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1.0 INTRODUCTION 


1 .1 Background 

The advent of redundant and highly reliable airborne digital systems has 

raised a number of critical issues in connection with the ability of such sys- 

tems to detect, isolate and recover from hardware faults. In such systems fault 
detection is a critical factor in achieving system reliability. The present 
study is essentially an investigation of the nature of faults and the dynamics 

of fault propagation and detection in digital systems. 

Most airborne systems, present and projected, employ comparison-monitoring, 
self-test or a combination of both techniques to achieve the requisite detection 
and isolation capability. One of the problems of fault detection by either 
technique, is that a fault may not manifest itself in a comparison-monitored 
variable or at an accessible output of a component until the faulted component 
is exercised by a suitable combination of input or internal state. As a conse- 
quence, the fault may not be detected by self-test or, in the case of compari- 
son-monitoring, the fault may remain latent for long periods of time. Prolonged 
latency in a redundant system can reduce survi vabil ity since such faults effec- 
tively increase the time-on-risk. 

In an effort to determine the dynamics of fault propagation and detection 
in a digital computer, NASA-Langley Research Center sponsored a pilot program 

entitled "Modeling of a Latent Fault Detector in a Digital System" (ref. 1), in 1978 
The objectives were to study how software reacts to a fault, to account for as 
many variables as possible affecting detection and to forecast a given software 
program's detecting ability prior to computation. A series of fault injection 
experiments were conducted using an emulation of a small, idealized processor 
with a very limited instruction set. The results of the study were surprising 
since they contradicted the prevailing belief that most hardware faults cause 
catastrophic computational errors. In fact, the study showed that a significant 
proportion of faults remained latent after many repetitions of a program. How- 
ever interesting these results were, they were greeted with a healthy skepti- 
cism. It was not clear, for instance, that similar results could be obtained 
for a real processor, preferably one used in actual airborne applications. As a 
consequence, it was decided to extend the study to include a real avionics 
processor. 


1 .2 Objectives of the Study 

The present study was based on the premise that a gate-level emulation of 
an avionics, airborne processor was available. Prior to award of contract the 
Bendix Research Laboratories and the Bendix Flight Systems Division had devel- 
oped a gate-level emulation of the Bendix BDX-930 digital computer. This compu- 
ter is used in a number of flight control and avionics programs, notably on the 
AFTI F-16 FBW system and SIFT.' SIFT (Software Implemented Fault Tolerance) is a 
fault tolerant digital computer system developed by SRI, International with 
Bendix, Flight Systems Division, as a major subcontractor. A description of the 
BDX-930 and its emulation will be given in subsequent sections. 
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Underlying the entire study was the intention to demonstrate that gate- 
level emulation was a viable and practical tool for coverage measurement failure 
modes and effects analyses of digital systems. It was for this reason that so 
much effort was expended in developing a fast and efficient emulator. Admittedly, 
there are many gate-level emulations available that emulate to a greater level 
of detail and, perhaps, even with greater fidelity then the one used in the pre- 
sent study. An informal survey of such emulators indicated that, except for 
hardware emulators, the run time was prohibitive, being on the order of 500,000 
to 1,000,000 slower than the emulated processor. 

As indicated previously, a primary objective of the present study is to 
ascertain whether and to what extent the results of the pilot study apply to a 
real avionics processor. Specifically, 

• Given a set of software programs ranging from a simple "fetch and store" 
to a complicated, multi -instruction algorithm, inject a single fault, 
selected at random, and observe the time to detection assuming that de- 
tection occurs whenever there is a difference between the computed out- 
puts of the faulted and non-faulted processors executing the same program. 
Determine differences in detection time when faults are injected at the 
gate-level and component-level . 

• Based upon derived empirical latency distributions, develop and validate 
a model of fault latency that will forecast a software program's detect- 
ing ability. 

The following additional objectives were added to those of the pilot study: 

• Given a typical avionics self-test program inject faults at both the 
gate-level and component-level and determine the proportion of faults de- 
tected. 

• Determine why undetected faults were undetected. 

• Recommend how the emulation of the BDX-930 can be extended to multi-pro- 
cessor systems such as SIFT. 

t Determine the proportion of faults detected by a miniprocessor BIT 
(built-in-test program) irrespective of self-test. 

1 .3 Foreword 

The authors would like to express their appreciation to: 

NASA-Langley Research Center who conceived and initiated the study; NASA Project 
Engineer Salvatore Bavuso whose advice and encouragement were indispensable and 
made the task a pleasant one; Bendix Research Laboratories who did most of the 
development of the emulator; Dr. Allen White of Kentron International for his 
critique of the statistical analyses; Prof. Mario Barbacci of Carnegie-Mel Ion 
University for his advice and assistance in developing the emulator. 

Use of trade names of manufacturers in this report does not constitute an 
official endorsement of such products or manufacturers, either expressed or 
implied, by the National Aeronautics and Space Administration. 
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2.0 SUMMARY 


• A gate-level emulation of the Bendix BDX-930 digital computer was devel- 
oped prior to the present study for the purpose of analyzing failure 
modes and effects in digital systems. The run time of the emulation was 
25,000 times slower than the BDX-930 when hosted on a PDP-10. 

t Six software programs were emulated and faults were injected at both the 
gate-level and pin-level (i.e., component-level). The resultant computed 
outputs were compared with those of a non-faulted computer executing the 
same program. A fault was considered detected when these outputs dif- 
fered. The results showed that: 

• Most detected faults are detected in the first repetition. Subse- 
quent repetitions do not appreciably increase the proportion of de- 
tected faults. 

• A large proportion of faults remained undetected after as many as 8 
repetitions of the program, e.g., 60% at the gate-level. 

• Component-level faults are easier to detect than gate-level faults. 
For example, after 8 repetitions, the proportion of undetected faults 
were 


GATE-LEVEL 

61 .7% 
58.2% 
59.5% 

for the program FETSTO, 


COMPONENT-LEVEL 

35.5% 

28% 

32.3% 

and ADDSUB, respectively. 


• The results of the study corroborate the findings of the pilot study of 
(ref. 1). This was surprising considering that the pilot study used an 
emulation of a very simple processor. As an illustration, the pilot 
study indicated that, after 8 repetitions, the proportion of undetected 
faults were 


64.4% 

53,7% 

44.9% 

for FETSTO, FIB and ADDSUB, respectively. 

• The Urn Model, for forecasting fault latency, produced distributions that 
were in close agreement with the empirical distributions. However, the 
rationale for the model should be analyzed further. 

t A self-test program of 2000 executable instructions was expressly de- 
signed for the study. The designer was given the single requirement that 
fault coverage should be at least 95%. The resultant test consisted of 
241 separate subtests for the purpose of exercising the entire instruc- 
tion set of the BDX-930. 
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The results indicated that there is a significant difference in coverage 
of gate-level versus component-level faults. For example, 

gate-level coverage = 86.5% 

component-level coverage = 97.9% 

• Only 48% of all detected faults were detected by a subtest. The remain- 
ing detected faults were detected because the first subtest was not 
computed . 

• Most of the subtests were redundant, i.e,, only 46 of 241 subtest actually 
detected a fault. 

• 62% of all detected faults were detected by the first 23 subtests. 

• A large proportion of "don't care" (i.e., indistinguishable) faults were 
injected. These proved to be exceedingly difficult to identify. 

t The micromemory prom contained the largest proportion of undetected 
faul ts . 

• The emulation can easily accommodate the SIFT system but with a 7-fold increase 
in run time. 



3.0 FAULT MODELLING AND SELECTION 


3.1 Fault Model 

At the present time there is little or no data available regarding either 
the mode or frequency of failures of MSI or LSI devices. Despite this defi- 
ciency of data, failure modes and effects analyses are regularly performed for 
avionics and flight control systems (a typical analysis is described in Sec- 
tion 11). The conventional approach is to assume a set of failure modes for 
each device. These are usual lly restricted to faults at single pins although, 
occasionally, multiple faults may be considered. In most cases the failure rate 
of a device is assumed to be equally distributed over the pins or over the set 
of postulated failure modes. Except for special devices, faults are assumed to 
be static, being either S-a-0 or S-a-1 . 

The point to be made here is that failure modes and their rate of occur- 
rence are necessarily conjectural and the credibility of the present study 
suffers no less from this deficiency of data then the conventional analysis. 

The authors emphasize that the emulation approach does not solve this problem. 

In the present study the following assumptions are made regarding failure 
modes : 

• Every device can be represented, from the standpoint of performance and 
failure modes, by the manufacturer-supplied, gate-level equivalent cir- 
cuit. 

• Every fault can be represented as either a S-a-0 or S-a-1 fault at a gate 
node . 

t The failure rate of the device is equally distributed over the gates of 
the equivalent circuit. 

f The failure rate of a gate is equally distributed over the nodes of the 
gate. 

• S-a-0 and S-a-1 faults are equally likely. 

• Memory faults are exclusively faults of single bits. 

• A memory fault is the complement of its non-faulted state. 

Faults are injected into all devices except the main memory. In the case 
of the microprogram memory, which is emulated at the functional level, faults 
are injected into the memory cells where they remain active for the duration of 
the test. Faults are injected at an input or output gate node, and also remain 
active for the duration of the test. When a fault is injected at an output node 
it is allowed to propagate to all nodes and devices that are physically con- 
nected to the failed node. When a fault is injected at an input node, it does 
not propagate back to the driving node. This strategy provides a wider variety 
of failure modes than would otherwise be possible if propagation were allowed. 
The fault model, although conjectural at the present time, can be updated as 
fault data becomes available. The proposed model provides a simple, automatic 
and consistent method of generating faults. The resultant fault set includes a 
rich assortment of static and dynamic (i.e., data-dependent ) faults. 
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3.2 Method of Selecting Faults 


The method of selecting faults is implicit in the fault model. Explicitly, 

• Each device is assigned a failure rate. 

• The failure rate is equally distributed over the gates of the gate-level 
representation. 

• The failure rate of each gate is equally distributed over the nodes of 
the gate. 

t The failure rate of each node is equally distributed over S-a-0 and S-a-1 
faul ts . 

• As a result of this procedure, each S-a-0 and S-a-1 fault is assigned 

a probability of occurrence equal to the prescribed failure rate. The 
resultant fault set is then randomly sampled with each fault weighted by 
its probability of occurrence. It is noted that, according to this pro- 
cedure, faults in devices with high failure rates will be selected more 
frequently than faults in devices with lower failure rates. 

The above procedure does not distinguish between gate-level and component 
(i.e., pin)-level faults except by probability of occurrence; the method auto- 
matically assigns failure rates to pins. However, a different selection proce- 
dure was employed for component-1 evel faults. For these faults it was assumed 
that: 


t The failure rate of each device' is equally distributed over the pins. 

While this assumption violates the prescribed fault model it is consistent 
with the conventional method of estimating fault detection coverage by simulat- 
ing faults in actual hardware. As a consequence, all component-level detection 
estimates obtained in the study are estimates that would be obtained by propo- 
nents of this approach. 
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4.0 DESCRIPTION OF EXPERIMENTS 


4.1 Definition of Failure Detection 

In the present study, fault coverage and latency estimates are obtained by 
employing two conventional techniques of failure detection: comparison-monitor- 

ing and self-test. 

In comparison-monitoring a set of computed variables is compared with a 
corresponding set computed in another processor. If it is arranged that both 
processors operate on identical inputs and are closely synchronized , then any 
difference in a computed variable signifies that one of the processors has 
failed. In practice each processor executes an algorithm which compares the 
appropriate variables and signals a discrepancy when such exists. In the pre- 
sent study this algorithm was omitted; a fault is considered to be detected if 
a difference between corresponding variables exists irrespective of the ability 
of either processor to recognize the difference or signal the discrepancy. Thus, 
the fault coverage obtained from the study is somewhat more optimistic than 
would be obtained in practice. 

In self-test, on the other hand, each component of the processor is exer- 
cised by a set of computations designed specifically to test that component. 

The results of each computational set are compared with pre-stored values and 
any difference signifies that the fault was detected. In practice, and in the 
study, the processor increments a register after the successful completion of 
each test and before proceeding to the next test. If the test is not successful 
the program exits. After an interval of time equal to the maximum time to com- 
plete the program, the contents of the counter are decoded. If the value exact- 
ly equals that total number of tests, the fault was not detected. Otherwise the 
fault was detected. 

It is emphasized that "failure detection", as it is used in the present 
study, means almost exactly what it means in an actual airborne avionic system. 
This is in marked contrast to the commonly employed alternate approach of 
assuming that a failure is detected whenever the effect to the failure reaches 
an accessible bus or register, even though the program may not be interrogating 
these devices at that time. 

In the following paragraphs a description is given of the actual computa- 
tions involved in the experiment with particular emphasis on the explicit defin- 
ition of "failure detection" in each instance. 

4.2 Definition of Failure Detection Coverage 

We assume that a test procedure is given for detecting failures of a compo- 
nent, C. Each failure mode of C will require a non-zero time for detection. By 
considering all failures of C and all combinations of inputs and internal states 
of C, we obtain in principle, if not in practice, a probability density function 
for time-to-detect, which is measured from the onset of the failure to the time 
of detection. 
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Denoting this density by pdf (t) where 


T = t1me-to-detect = latency time 


we define 
Test Coverage 

1 ) 1 - a(t) 


T 



pdf(X)dX 


= probability of detecting a failure of C in 
the interval 0 < t < t. 

Observe that, according to this definition, test coverage is a function of 
latency time. The definition can be extended to all devices of the computer as 
fol lows : 

Subdivide the computer into mutually exclusive components 
with failure rates and test coverages 1 - a-jCr)* 1 - 

1 - respectively. 


Set pdf^. (t) * probability density for time-to-detect failures of 

C^, i = 1 , 


Then the pdf for all failures of the computer is 

i=k 

2) pdf(x) = E pdf,-(T) 

1=1 

where X - X.j + X2 + '*■ 


Test coverage of the whole computer is then 

i=k 

3) 1 - o (t) ■ I ^ (1 - a ^(t) ). 

i»l 

The method of selecting faults, described in Section 3, is consistent 
with this definition. 
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From (3) we obtain 

i»k 

4) a(T) * Z ct-f('r) « expected. 

1-1 

One of the objectives of the present study is to obtain estimates of the 
probability density function, pdf (t). These estimates are presented in Sec- 
tion S. 

4.3 Indistinguishable Faults and Effects on Coverage 

During the development of the emulator it became apparent that a signifi- 
cant proportion of components had no affect whatsoever on the digital process. 
For the most part, these components are associated with unused pins, e.g., a 
complementary output of a flip-flop. However, there are other components whose 
lack of effect are not as obvious as, for example, a component that only affects 
the process when it is faulted. Certain micromemory bits are in this category. 
In order to distinguish between these categories of faults we are lead to the 
following informal definitions: 

A fault that has no affect on the computational process is 

indistinguishable . All other faults are distinguishable . 

We note that a distinguishable fault has the property that there exists a 
software program the output of which differs from that of the same program exe- 
cuted by an identical but non-faulted processor. 

Effects on Coverage 

The presence of indistinguishable faults can lead to erroneous and mis- 
leading estimates of coverage. In theory, indistinguishable faults should be 
disgualified from the emulation or from the fault selection process. This is 
consistent with the definition of coverage which implicitly assumes that 
faults are distinguishable. Unfortunately, in order to disqualify indistin- 
guishable faults from the emulation or from the fault selection process they 
must be first identified and this is a non-trivial task because of the large 
number of possible faults. The approach taken in this study was to select 
faults irrespective of their distinguishabil ity properties and analyze only 
those faults that were undetected by Self-Test. The proportion of indistin- 
guishable faults from this set was then used as an estimate over all faults. 
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We now Indicate, briefly, how indistinguishable faults affect coverage. 


If 

Y = proportion of components yielding indistinguishable faults 
and 

1 - a * coverage of distinguishable faults 

then 

1 - a * desired coverage 

and 


5) (1 - a ) (1 - Y ) ® coverage when indistinguishable faults are 

counted as undetected. We note, incidentally 
that 


6) (1 - a ) (1 - Y ) ■•■ Y * coverage when indistinguishable faults 

are counted as detected. 

The estimate of (5) will be obtained if indistinguishable faults are not 
disqualified. Then, coverage estimates will be in error by the factor, 1-y. 

In the more general case it may be more convenient to estimate the propor- 
tion of indistinguishable faults by partition since the affect on coverage is 
a function of the relative failure rate of the partition. 

Let - failure rate of Partition #i , i *1,2, 6. 

Y.j ® proportion of indistinguishable faults in Partition #i . 

1 - * coverage of distinguishable faults in Partition #i . 

X * + X2 + .... '*■ ^6 * failure rate. 

From the previous section, if all faults are distinguishable then coverage 
is given by 


6 

7) 1 - a * E (1 - a .) 

i*1 
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If, however. Indistinguishable faults are counted as undetected then the 


coverage actually obtained is 

6 

8) 1 - a = Z ^ 0 - a ,.) 0 - Y ^). 

1=1 


We note that, if indistinguishable faults are disqualified, the true 
coverage is 

6 

9) 1 - a = I X. (1 - a .) (1 - Y 

M 

6 

Z (1 - Y |) Xi 
i=l 

From (8) it can be seen that the required accuracy of an estimate of y^- 
depends upon the relative failure rate, A^./A. If A^ is sufficiently small 
then the effect of an inaccurate estimate of is negligible. 

4.4 Objectives of Experiments 

Most airborne systems, present and projected, employ conparison-monitoring, 
self-test or a combination of both to achieve the requisite detection and isola- 
tion capability. One of the problems of fault detection, by either method, is 
that a fault may not manifest itself at either a comparison-monitored variable or 
at an accessible output of self-test until the faulted component is suitably 
exercised. As a consequence, faults can remain latent for long periods of time. 
This is the significance of latency time, t, in the definition of test coverage 
of Section 4.2. 

One of the objectives of the experiments is to estimate t for a variety of 
computations including self-test. The Phase I experiments consist of six soft- 
ware programs ranging from a simple fetch and store to a complicated multi -in- 
struction, linear convergence algorithm. Using compari son-moni tori ng the prob- 
ability distribution of t will be estimated for each of the six programs and the 
interdependence of these distributions and the number and type of instructions 
wi 1 1 be ascertained . 

The Phase II experiments utilize a typical avionics system self-test program 
which consists of 241 separate, sequential tests. The program consists of 2000 
executed instructions which requires an execution time of 3 milliseconds on the 
BDX-930. 



In practice, the only measure of failure detection coverage when applied to a 
self-test program is whether or not the fault is detected before the normal com- 
pletion of the self-test program. In particular, latency time has little or no 
significance in this context. Nevertheless, an equivalent latency time was es- 
timated for self- test by tabulating the number of the test that actually detec- 
ted the fault, the tests being executed in the order 1, 2, ..., 241. 

A secondary objective of the experiment was to corroborate the results of 
(ref. 1) which utilized the same Phase I program executed on an idealized, "very 
simple processor'.' 


4 . 5 Phase I Experiments 

This phase consisted of six programs each of which was coded in the assembly 
language of the BDX-930. For the purpose of comparison with the experiments 
performed in (ref. 1) the instructions of the BDX-930 were primarily restricted 
to the following set; 

LOAD 

STORE 

ADD 

SUBTRACT 

BRANCH 

In the following descriptions only the set of computations labelled "com- 
pute" were performed by the target BDX-930 CPU; all other computations, selec- 
tions, comparisons, etc. were performed by the emulation host computer Executive, 
Needless to say, there were no failures in these latter computations. 

When the non-failed processor completed a computation* and before the start 
of the next computation the Executive recomputed all initializing variables and 
stored them in the appropriate locations of the scratchpad memory. 

In the parallel mode of operation, when 36 computers are simultaneously 
being emulated, the initializing variables are stored simultaniously in the 36 
copies of the scratchpad memories. 


* In the parallel mode of operation one of the emulated processors is non-fault- 
ed and, as a consequence, the end of its computation cycle can be determined 
from its program counter. 


4.5.1 Fibonacci (FIB) 


a. Procedure 

TO) Select integers A, B, at random from the interval 

-2^ + 1 < X < 2^ -1 . 

For each fault: 

T1 ) Preset the program counter to the address of the first instruction. 
T2) Store A, B in successive locations of memory. 

T3) Compute and store in successive locations of memory 



^8 ° ^6 ^ ^7 . 

T4) When the non-failed processor completes its last instruction compare 
S-j , S^, Sg, term by term, in both the non-failed and failed pro- 

cessors. If S|^ is the first variable to miscompare set L = K 
(L = latency period). If all Sj^ compare (undetected failure), 
set L = 0. 

b. Instruction Set 


During a typical computation the following instructions were executed: 


INSTRUCTION 

LOAD 

STORE 

ADD 

BRANCH 

CLEAR 
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a . Procedure 


4.5.2 Fetch and Store (FETSTO) 


TO) Select 8 integers, Aj^, at random from the interval 
-2^^ + 1 < A,^ < 2^® -1 . 


For each fault; 

T1 ) Preset the program counter to the address of the first 
instruction. 

T2) Store the Aj^ in successive locations of memory . 

T3) Compute: 

Fetch the A|^ and store in successive locations of 
memory . 

T4) If the re-stored value of Aj^ is denoted by Sj^ then, when 
the non-fail ed processor completes its last instruction 

compare $2$ Sg, in both the non-failed and failed 

processors. the first variable to miscompare set 

L = K. If all Sj^ compare set L = 0. 
b. Instruction Set 

During a typical computation the following instructions were 
executed: 


INSTRUCTION 

LOAD 

STORE 

SUBTRACT 

BRANCH 


FREQUENCY 

1 

2 

1 

2 
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4.5.3 Add and Subtract (ADDSUB) 

Procedure 

TO) Select 8 integers, A|^, at random from the interval 

-2’^ + 1 < A|j < 2’^ - 1 . 


For each fault: 

T1 ) Preset the program counter to the address of the first instruction. 
T2) Store the Aj^ in successive locations of memory. 

T3) Compute and store in successive locations of memory: 


S] = - Aj 


$2 ” ^2 


^3 * ^3 ■ ^4 


» A 3 + A^ 


^5 ^5 " ^6 


^6 " ^5 ^6 




$8 » A7 + As ^ 


T4) When the non-failed processor completes its last instruction compare 
S-| , S 2 > Sg, term by term, in both the non-failed and failed 

processors. If is the first variable to miscompare set L = K. 

If all compare set L = 0. 
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b. 


Instruction Set 


During a typical computation the following instructions were executed. 
INSTRUCTION 


LOAD 

STORE 

ADD 

SUBTRACT 

BRANCH 

TRANSFER 


FREQUENCY 

2 

2 

2 

1 

2 

2 


4.5.4 Search and Compute (SERCOM) 

a. Procedure 

TO) Select 8 sets of integers, {Aj^ C^^), at random, each component 

from the interval 

0 < X < 20 . 

For each fault: 

T1 ) Preset the program counter to the address of the first instruction. 
T2) Store the (Aj^, Cj^) in successive locations of memory. 

T3) Compute and store in successive locations of memory 


hk = ®k ^ S T 

^2k “ ®k ) 

^1k ' ®k * •'k ] 

^2k ’ ®k * *'k J 

^Ik ° ®k ■ ''k^*T)l 

^2k * ®k ^ *'k J 


If < A, 

1 f A|^ ^ Bj^ and < A^ 

if A|^ < and < Cj^ ^ 


*1 Multiplication is performed by successive addition. 
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(T4) When the non-failed processor completes its last instruction com- 
pare S^i^, $ 21 ^ term by term, in both the non-failed and failed 

processors . (*2) If or $ 2 |^ is the first variable to miscompare 

set L = K. If all $ 21 ^ compare set L = 0. 

b. Instruction Set 

During a typical computation the following instructions were executed: 
INSTRUCTION FREQUENCY 


LOAD 

6 

STORE 

6 

ADD 

17 

SUBTRACT 

1 

BRANCH 

24 

TRANSFER 

5 


4.5.5 Linear Convergence (LINCON) 

a. Procedure 

TO) Select the following integers from the indicated intervals: 


M 

-8 < M <8 

0 , 

_= Q = 

Y 

-2^^ + 1 < Y„ < a’* - 1 

0 , 

= 0 = 

^1 * ^2 * • • • • * Xg 

0 < Xj^ < 2^\ 


Assume that X-j < X2 <....< Xg . 

For each fault: 

T1 ) Preset the program counter to the address of the first instruction. 

T2) Store M , Y , X,, X^, X„, in successive locations of memory. 

0 0 I c o 


*2 Although this program was written to utilize 8 sets of integers, this 
experiment was performed using only 1 set. 
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T3) Compute Mj^, Yj^, for K = l,2 8 as specified in the following flow 

flow diagram of figure 1 , and store in successive locations of memory. 

Note again, that all multiplications are performed as successive additions. 

T4) When the non-fail ed processor completes its last instruction compare 
M-j , , Mg, Yg, term by term, in both the non-failed and failed 

processors. If Mj^ or Yj^ is the first variable to miscompare set L=K*. 

If all Mj^, Y|^ compare set L=0. 

b. Instruction Set 

During a typical computation the following instructions were executed: 
INSTRUCTION FREQUENCY 


LOAD 

38 

STORE 

38 

ADD 

16 

SUBTRACT 

4 

BRANCH 

39 

TRANSFER 

11 

CLEAR 

1 


a . 


4.5.6 Quadratic (QUAD) 


Procedure 

TO) Select 8 sets of integers, 
cated intervals: 


A 


k 


f 



(Aj^, C|^, X|^), at random from the indi- 

0 < X < 2^® - 1 


Xk, 


For each fault: 

T1 ) Preset the program counter to the address of the first instruction. 
T2) Store the (Aj^, Xj^) in successive locations of memory. 


* Although this 
was performed 


program was written to utilize 8 values of X, 
using only X-j , 


this experiment 


k 
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T3) 


Compute and store in successive locations of memory (overflows are 
ignored); 


Sk “ wh - hh - 

»(* 2 ) 


<=■ 1 . 2 . 


T4) When the non-fail ed processor completes its last instruction, compare 
, $25 Sg, term by term, in both the non-fa i led and failed pro- 

cessors. If S|^ is the first variable to miscompare set L = K. If all 
compare set L = 0. 


b. Instruction Set 


During a typical computation the following instructions were executed: 
INSTRUCTION FREQUENCY 


LOAD 

7 

STORE 

5 

ADD 

30 

SUBTRACT 

1 

BRANCH 

38 

TRANSFER 

6 


4 .6 Phase II Experiments 

This phase consists of injecting faults and executing a typical avionic 
flight control system self-test program to determine failure detection coverage. 
The self-test program was written expressly for this study. 

A flight control system may employ one or more self-test programs and in a 
variety of ways. For example, as a background program, in pre-flight test, in 
maintenance test or on-line to isolate a failed computer. While the present 
study does not preclude any of these, the on-line application is the most inter- 
resting and critical. For orientation purposes an on-line application of self- 
test will be briefly described. The control system consists of three, identical 


*1 Multiplication is performed by successive addition. 

*2 Although this program was written to utilize 8 sets of integers, this 
experiment was performed using only 4 sets. 
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digital computers each driving a command port of a triplex, mechanically voted 
actuator. A first failure is detected via comparison-monitoring and the offend- 
ing computer is desengaged. The second computer failure is also detected via 
comparison-monitoring and, upon detection, each of the two computers executes a 
self- test program designed to detect a specified proportion of failures in a 
specified period of time. The inability to successfully complete the self-test 
program or an explicit detection of the failure during this period of time 
causes the faulted computer to be disengaged from the actuator - leaving the 
remaining computer in control. The disengagement logic is independent of the 
CPU. Essentially, disengage is "armed" after the first failure. Either of the 
remaining computers can, thereafter, call for an interrupt to self-test if it 
detects a miscompari son . This interrupt occurs in both computers and sets the 
program counters to the first instruction of the self-test and activates a one- 
shot of duration slightly in excess of the time it takes for a non-fail ed compu- 
ter to complete the self-test program. At any time until the one-shot times - 
out the self - test program may set a discrete output word whose value indicates 
whether or not the failure was detected. This discrete word is decoded in hard- 
ware . 


If the value corresponds to a predetermined value (which does not exist in 
memory but must be computed) then the computer successfully completed self-test 
and, of course, did not find the fault. If the value is negative the computer 
is immediately disengaged. If. however, the word was not changed at all, having 
been initialized to zero at the start of the self-test, the computer is disen- 
gaged after the time-out if and only if the other computer successfully passed 
its self-test. 

Before describing the fault injection procedure a brief overview of the 
self-test program will be given. 

The self-test program provides the option of selecting any one of 14 test 
sequences, depending upon the coverage desired. The difference between these 
test sequences is in the number of input and internal state combinations 
employed in testing an instruction. A test procedure is specified by setting 
N, in location, ARG, to an integer value between 1 and 14, the complexity and 
length of the tests increasing with increasing N. In the present study N = 1. 

The resultant test procedure consisted of 241 separate tests. 

After a successful completion of a test the program increments register A-|^ 

(A^^ is initialized to 0) and proceeds to the next test in the sequence. If, 

however, a failure is detected the program skips the remaining tests and trans- 
fers the contents of A-j 4 to a designated memory location whose contents, ANSW, 
became the measure of failure-detection. If the program successfully completed 
all tests then ANSW = 241. 
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In the Phase II experiments a fault was defined as detected if, after a 
complete execution of the self-test program by the non-failed processor, 

ANSW ?^241 in the faulted processor. Observe that, according to this definition, 
a fault is detected if the faulted processor jumps out of the program, gets 
hung-up in an infinite loop or executes a single extra instruction before trans- 
'ferring the contents of A-j^. 


4.6.1 Self-Test 


For each fault: 

T1 ) Preset the program counter to the address of the first instruction 

(i.e., to CPUT) and initialize the stack pointer (i.e., register A^^) 

to the starting location of the scratchpad memory. 

T2) Compute (i.e., execute self-test). 

T3) When the processor completes the equivalent number of microcycles, 

corresponding to a complete execution of the self-test program by the 
non-failed processor, halt. The fault is detected if and only if 
ANSW ^241 . 
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5.0 RESULTS OF EXPERIMENTS 


In this section the data from the experiments is presented concisely and 
with a minimum of commentary. A detailed analysis of the results is given in 
the next section. 


5.1 Distribution of Faults 

As indicated previously the selection of faults was random with each device 
weighted in proportion to its failure rate. S-a-0 and S-a-1 faults were equally 
weighted. The failure rates associated with each partition are given in Table 1. 
Initially, 1,000 gate-level and 400 component-level faults were randomly 
selected. Later, in order to reduce the cost of the runs it was necessary to 
reduce the number of faults actually injected. The number of faults finally 
selected for each experiment are given in Table 2. A detailed breakdown of 
the number of faults injected into each partition are given in Tables 3 & 4 
for the Phase I and Phase II experiments, respectively. The numbers in paren- 
theses are the number of faults that should have been selected if the sampling 
had been stratified over the partitions. Except for the Phase II component- 
level faults the indicated quantities include indistinguishable faults. 

The same set of 400 component-level faults was used in all Phase I experi- 
ents. From these 400, 200 were randomly selected and used in Phase II (i.e., 
SELF-TEST). The same 1,000 gate-level faults were used in FETSTO AND SERCOM. 

From these 1,000, 600 were randomly selected and used in ADDSUB, FIB, QUAD and 
LINCON. From these 600, 300 were randomly selected and used in Phase II. 

5.2 Phase I Experiments 

5.2.1 FETSTO Experiment 

After each injected fault FETSTO was executed for 8 repetitions. The re- 
sultant histograms of detected faults versus repetitions to detection are shown 
in Figures 2a through 2i . Tabular results of the raw data are given in 
Table 5. 


Figures 2a, 2b, Summarized 

(Combined, S-a-1 and S-a-0 gate-level faults) 


0 

61 

.7% 

undetect 

0 

29, 

.9% 

detected 

0 


1% 

detected 

0 

59, 

CO 

of S-a-1 

0 

63 

.3% 

of S-a-0 
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Figures 2c, 2d, 2e, Summarized 


(Combined gate-level faults by partition) 

0 98% of faults in Partition #5 undetected. 

0 96.3% of faults in Partition #6 undetected. 

We note that Partition #5 contains the micromemory. 


Figures 2f, 2g, Summarized 


(Combined, S-a-1 
Q 35% 

0 51.3% 

0 13.2% 

0 31% 

0 40.1% 


and S-a-0 component-level faults) 

undetected after 8 repetitions (compared with 61.7% for 
gate-level faults). 

detected in 1st repetition. 

detected in repetitions 2 through 8. 

of S-a-1 faults undetected. 

of S-a-0 faults undetected. 


Figures 2h, 2i , Summarized 

(Combined component-level faults by partition) 

0 Partitions #5 and #6 did not allow for component-level faults. Pin 
faults (i.e,., component-level) were injected at adjacent partitions. 

5.2.2 ADDSUB Experiment 

After each injected fault ADDSUB was executed for 8 repetitions. The re- 
sultant histograms of detected faults versus repetitions to detection are shown 
Figures 3a through 3i . Tabular results of the raw data are given in Table 6. 


Figures 3a, 3b, Summarized 

(Combined, S-a-1 and S-a-0 gate-level faults) 

0 59.6% undetected after 8 repetitions. 

0 33.5% detected in 1st repetition. 

0 7.0% detected in repetitions 2 through 8. 

0 54.% of S-a-1 faults undetected. 

0 64.2% of S-a-0 faults undetected. 

Figures 3c, 3d, 3e, Summarized, 

(Combined gate-level faults by partition) 

0 99% of faults in Partition #5 undetected. 

0 100% of faults in Partition #6 undetected. 
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Figures 3f, 3q, Summarized 


(Combined, S-a-1 and S-a-0 component-level faults) 

0 32.3% undetected after 8 repetitions (compared with 59.5% for gate- 
level faults). 

0 57% detected in 1st repetition. 

0 10.7% detected in repetitions 2 through 8 

0 27.6% of S-a-1 faults undetected. 

0 37.1% of S-a-0 faults undetected. 

Figures 3h, 3i , Summarized 

(Combined component-level faults by partition) 

0 Partitions #5 and #6 did not allow for component-level faults. Pin 
faults were injected at adjacent partitions. 

5,2.3 FIB Experiment 

After each injected fault FIB was executed for 8 repetitions. The result- 
ant histograms of detected faults versus repetitions to detection are shown in 
Figures 4a through 4i . Tabular results of the raw data are given in 
Table 7, 


Figures 4a, 4b, Summarized 


(Combined, S-a-1 
0 58.2% 

Q 35% 

0 6 . 8 % 

0 54.3% 

0 62.2% 


and S-a-0 gate-level faults) 
undetected after 8 repetitions, 
detected in 1st repetition, 
detected in repetitions 2 through 8 
of S-a-1 faults undetected, 
of S-a-0 faults undetected. 


Figures 4c, 4d, 4e, Summarized 

(Combined gate-level faults by partition) 

0 98% of faults in Partition #5 undetected. 

0 100% of faults in Partition #6 undetected. 


Figures 4f, 4g, Summarized 

(Combined, S-a-1 and S-a-0 component-level faults) 

0 28% undetected after 8 repetitions (compared with 58.2% for 

gate-level faults). 

0 61.3% detected in 1st repetition. 


32 



0 10.7% detected in repetitions 2 through 8. 

0 22.7% of S-a-1 faults undetected. 

0 33.5% of S-a-0 faults undetected. 

Figures 4h, 4i , Summarized 

(Combined component-level faults by partition) 

0 Partitions #5 and #6 did not allow for component-level faults. Pin 
faults were injected at adjacent partitions. 

5.2.4 QUAD Experiment 

After each injected fault QUAD was executed for 4 repetitions. The resul- 
tant histograms of detected faults versus repetitions to detection are shown in 
Figures 5a through 5i . Tabular results of the raw data are given in 
Table 8. 

Figures 5a, 5b, Summarized 

(Combined, S-a-1 and S-a-0 gate-level faults) 

0 53.3% undetected after 4 repetitions. 

0 43.2% detected in 1st repetition. 

0 3.5% detected in repetitions 2, 3 and 4. 

0 49.3% of S-a-1 faults undetected. 

0 57.4% of S-a-0 faults undetected. 

For comparison purposes it is desirable to extrapolate the results of QUAD 
to 8 repetitions. To obtain a rough extrapolation we note that the average 
proportion of detected faults in repetitions 2 through 8 in the FETSTO, ADDSUB 
and FIB experiment is 7.4%. Using this estimate we obtain: 

0 49.4% undetected after 8 repetitions. 

Figures 5c, 5d, 5e, Summarized 

(Combined gate-level faults by partition) 

0 97.1% of faults in Partition #5 undetected after 4 repetitions. 

0 100% of faults in Partition #6 undetected after 4 repetitions. 

Figures 5f, 5g, Summarized 

(Combined, S-a-1 and S-a-0 component-level faults) 

0 23.5% undetected after 4 repetitions (compared with 53.3% for gate- 


71 .8% 


level faults), 
detected in 1st repetition 
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I 


0 4.7% detected in repetitions 2, 3 and 4. 

0 20.2% of S-a-l faults undetected. 

0 26.9% of S-a-0 faults undetected. 

Again, to extrapolate the results of QUAD to 8 repetitions we note that the 
average proportion of detected component-level faults in repetitions 2 through 8 
in the FETSTO, ADDSUB and FIB experiment is 11.5%. Using this estimate we 
obtain: 

0 16.7% undetected after 8 repetitions. 

Figures 5h, 5i , Summarized 

(Combined component-level faults by partition) After 4 repetitions, 

0 28.6% undetected in Partition #1 . 

0 10.2% undetected in Partition #2. 

0 43.6% undetected in Partition #3. 

0 18.5% undetected in Partition #4. 

5.2.5 SERCOM Experiment 

After each injected fault SERCOM was executed for a single repetition. The 
resultant histograms of detected faults versus repetitions to detection are 
shown in Figures 6a through 6i . Tabular results of the raw data are given 
in Table 9. 

Figures 6a, 6b, Summarized 

(Combined, S-a-1 and S-a-0 gate-level faults) 

0 60.5% undetected after a single repetition. 

0 39.5% detected in the 1st repetition. 

0 57.3% of S-a-1 faults undetected. 

0 63.6% of S-a-0 faults undetected. 

As in QUAD, extrapolating to 8 repetitions, we obtain: 

0 53.1% undetected after 8 repetitions. 

Figures 6c, 6d, 6e, Summarized 

(Combined gate-level faults by partition) 

0 98% of faults in Partition #5 undetected. 

0 100% of faults in Partition #6 undetected. 
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Figures 6f, 6q, Summarized 


(Combined, S-a-1 

and S-a-0 component-level faults) 

0 35.2% 

undetected after a single repetition (compared with 60.5% for 
gate-level faults). 

0 64.8% 

detected in 1st repetition. 

0 27.6% 

of S-a-1 faults undetected. 

0 43.1% 

of S-a-0 faults undetected. 

Extrapolating the results to 8 repetitions we obtain: 

0 23.8% 

undetected after 8 repetitions. 

Figures 6h, 6i , 

Summari zed 

(Combined component-level faults by partition). After a single repetition. 

0 24.4% 

undetected in Partition #1. 

0 33.6% 

undetected in Partition #2 

0 46.8% 

undetected in Partition #3. 

0 35.9% 

undetected in Partition #4. 


5.2.6 LINCON Experiment 

After each injected fault LINCON was executed for a single repetition. The 
resultant histograms of detected faults versus repetitions to detection are 
shown in Figures 7a through 7i . Tabular results of the raw data are given 
in Table 10. 

Figures 7a, 7b, 

Summari zed 


(Combined, S-a-1 and S-a-0 gate-level faults) 

0 48.3% undetected after a single repetition. 

0 51.7% detected in the 1st repetition. 

0 46.7% of S-a-1 faults undetected. 

0 50% of S-a-0 faults undetected. 

As is SERCOM, extrapolating to 8 repetitions, we obtain, 

0 40.9% undetected after 8 repetitions. 

Figures 7c, 7d, 7e, Summarized 

(Combined, S-a-1 and S-a-0 component-level faults) 

0 96.2% of faults in Partition #5 undetected. 

0 100% of faults in Partition #6 undetected. 
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Figures 7f, 7q, Summarized 

(Combined, S-a-1 and S-a-0 component-level faults) 

0 23.5% undetected after a single repetition (compared with 48.3% for 
gate-level faults). 

0 76.5% detected in 1st repetition. 

0 21.2% of S-a-1 faults undetected. 

0 25.9% of S-a-0 faults undetected. 


Extrapol ating the results to 8 repetitions we obtain. 


0 12% undetected after 8 repetitions. 


Figures 7h, 7i, Summarized 

(Combined component-level faults by partition) 

0 20.8% undetected in Partition #1. 

0 10.9% undetected in Partition #2. 

0 41.5% undetected in Partition #3. 

0 26.1% undetected in Partition #4. 

The Phase I results are concisely summarized in Table 11. 

5 . 3 Phase II Experiments 

5.3.1 Indistinguishable Fault Estimates 

As indicated in Section 5.1, 300 gate-level faults and 200 component-level 
faults were injected in the Phase II experiments. In order to obtain an esti- 
mate of the proportion of indistinguishable faults each resultant, undetected 
fault was analyzed and those faults which were obviously indistinguishable were 
disqualified. At the gate-level, 71 out of 300 faults were identified as in- 
distinguishable. Thus, the estimated proportion of components yielding indis- 
tinguishable are: 

Y* = = 0.2366 at the gate-level 

and Y* = = .055 at the component-level 


Since indistinguishable faults were not disqualified in the Phase I exper- 
iments all coverage estimates of Phase I should be divided by the appropriate 
1 - Y* factor, as prescribed in Section 10. 

5.3.2 Self-Test Coverage 

Having disqualified 71 indistinguishable faults 229 faults were effectively 
injected at the gate-level and 189 at the component-level. The resultant raw 
data is given in Table 12 by partitions. 
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As indicated previously, after each injected fault the self-test program 
was executed. Faults were generally detected either because an explicit test 
detected the fault or the fault caused a jump out of the program. These latter 
faults are denoted in Table 12 by "wild branches". 

5.3.3 Gate-Level Faults 

From Table 19 we observe, 

0 198 out of 229 combined faults were detected for a coverage of 86.46%. 

0 100 out of 114 S-a-1 faults were detected for a coverage of 87.72%. 

0 98 out of 115 S-a-0 faults were detected for a coverage of 85.22%. 

0 9 out of 17 faults in Partition #5 were detected for a coverage of 52.94%. 

0 5 out of 8 faults in Partition #6 were detected for a coverage of 62.5%. 

0 If faults in Partitions #5 and #6 are disqualified then 184 out of 204 
faults were detected for a coverage of 90.2%. 

As an indication of fault latency the test that actually caused detection 
of the fault was recorded. It is recalled that arithmetic register A^^ is in- 
cremented after the successful completion of each of the 241 tests that comprise 
Self-Test. If a test is unsuccessful or if all tests are successful the 
contents of A^^ are transferred to a designated memory location, ANSW. The 

fault is considered detected if, after a complete execution of Self-Test by the 
non-failed processor, the contents of ANSW ?^241 . 

Occasionally a fault can result in an incorrect incrementation of A^^ or 

prevent the transfer to ANSW. In the former case the test cannot be identified 
correctly. In the latter case the contents of ANSW remains at its initial value 
of zero and, as a consequence, does not indicate the correct test number either. 

The procedure used to identify the test was to set: 

(Test #)- 1 = (ANSW) when (ANSW) 0. 

(Test #)- 1 = (^ 1 ^) when (ANSW) = 0. 

This results in the correct identification of the test, in most cases. 

Table 13 gives the number of gate-level and component-level faults 
detected. When (Test#)- 1=0 the effect is referred to, in Table 12, as a 
"wild branch" since, in most cases, the fault caused a jump out of the program. 

From Table 13 we observe:. 

0 103 out of the 198 faults detected resulted in wild branches, i.e., 52%. 

0 95 faults were detected by an explicit test (even though it was not al- 
ways possible to identify the test). 
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0 Out of the 241 possible tests, at most 46 actually resulted in a detec- 
tion, i.e., most of the tests were, effectively, redundant. 

5.3.4 Component-Level Faults 

From Tables 12 and 13 we observe: 

0 185 out of 189 combined faults were detected for a coverage of 97.9%. 

0 97 out of 100 S-a-1 faults were detected for a coverage of 97%. 

0 88 out of 89 S-a-0 faults were detected for a coverage of 98.9%. 

0 106 out of 189 faults detected resulted in wild branches, i.e., 56%. 

0 79 faults were detected by an explicit test (even though it was not al- 
ways possible to identify the test). 

0 Out of 241 possible tests, at most 44 actually resulted in a detection. 

Table 14 shows the self-test coverage at the gate and component-levels by 
parti tions . 


5.4 URN Model Parameters 

From the Phase I experiments the parameters of the Urn Model were estimated 
for the three programs 


FETSTO 

ADDSUB 

FIB 

for combined, S-a-1 and S-a-0 gate-level and component-level faults. 

Table 15 

This table gives the exact and approximate maximum likelihood estimates of 
a, P and P^, as defined in Section 10. Also shown are the resultant, computed. 

Urn Model distribution of terms of the occupancy probabilities of cells 1, 

2, 8. These correspond to the probabilities X^. , Y. or Z^. for S-a-0, S-a-1 

and combined faults, respectively. In keeping with our subsequent notation, the 
occupancy probability of cell 9 is actually the probability that the fault is 
undetected in the previous 8 repetitions. As a comparison, the corresponding 
empirical distributions are also given. These were obtained directly from the 
latency distributions of Section 5.2. 

Referring to the table, we note that the approximate estimates are accu- 
rate to two decimal places, in most cases. 
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Figures 8, 9, 10 


The resultant, computed Urn Model distributions are shown graphically in 
these figures using, in all cases, the exact estimators. The distributions 
are superimposed on the corresponding empirical distributions of Section 5.2. 

Table 16 

This table gives the elements of the error covariance matrix for the Urn 
Model estimates. The matrix was obtained using the exact maximum likelihood 
estimates . 

Table 17 

This table gives the elements of the inverse error covariance matrix of 
Table 16. 

Table 18 


For completeness, the intermediate parameters that were used to obtain the 
estimates of a, P and are given in this table. The intermediate parameters 

are m^ , m 2 , mg. A, B, C, D as defined in Section 10. We note that the sym 

bols m^ referred to S-a-0 faults previously but, in the context of Table 18, 

they refer to S-a-0, S-a-1 or combined faults, as the case may be. 

5.5 Accuracy and Confidence of Results 
5.5.1 Phase I Results 

The accuracy of the Phase I results will be illustrated by the combined, 
gate-level FETSTO experiments. Using the marginal distributions for latency 
cells #1 and #9 and using (24) of SectionlO.6 gives, for the errors at the 95% 
confi dence 1 evel , 

J (9.36S) 
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where .299 

Xg .617 
m = 1000. 

When the multivariate distribution is used, i.e., (23) of Section 10,6, the 
corresponding errors are 

®1 ' = - 0^2 ( 14 %) 

^9 ’ V tssj “ (9.9?) 
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5.5.2 Phase II Results 


The accuracy of the Phase II results can be estimated using (13) of 
Section 10. 5. We illustrate using the combined gate and component-level faults. 

At the gate-level the estimated coverage is 

2 .8646 

with a sample size of 229, the indistinguishable faults having been disquali- 
fied. At the 95% confidence level the resultant error is 


e * 1.96 J * -- -^ 229 ^ ^ ~ -044 (5.1%), approximately. 


At the component-level the estimated coverage is 
z ^ .979 

with a sample size of 189. At the 95% confidence level the resultant error 
is 


e » 1.96 J * *020 (2.0%), approximately. 

5.5.3 Urn Model Results 

The accuracy of the Urn Model results will be illustrated by the com- 
bined, gate-level FETSTO experiment. In this experiment the parameters were 
estimated to be 

P ^ .781 
Pq ^ .383 

a ^.464 

using the approximate estimators (28) of Section 10.7. 
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From (29) the errors at the 95% level are, approximately, 

"p ■ i f ob r H - sal - • -0^^ (5.2%) 

\ ’ = -030 (7.8%) 

"a • 1-56 /5i?5P!W ■ 
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TABLE 1 


FAILURE RATES OF PARTITIONS OF THE CPU 


PARTITION 

#1 

#2 

#3 

#4 

#5 

#6 



7.063 
1 .188 
45.023 


44 


J 



TABLE 2 


NUMBER OF FAULTS INJECTED 


EXPERIMENT 

GATE-LEVEL 

COMPONENT-LEVEL 

FETSTO 

1000 

400 

AODSUB 

600 

400 

FIB 

600 

400 

QUAD 

600 

400 

SERCOM 

1000 

400 

LINCON 

600 

400 

SELF-TEST 

300 

200 
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TABLE 3a 


PHASE I EXPERIMENTS 

NUMBER OF GATE-LEVEL FAULTS INJECTED BY PARTITIONS 


PROGRAMS: FETSTO, SERCOM 


PARTITION 

S-a-O 

S-a 

-1 

COMBINED 

1 

82 (81) 

76 

(80) 

158 (162) 

2 

147 (133) 

127 

(132) 

274 (265) 

3 

87 (89) 

98 

(87) 

185 (176) 

4 

98 (108) 

108 

(107) 

206 (214) 

5 

75 (74) 

75 

(78) 

150 (157) 

6 

14 (13) 

13 

(13) 

27 (26) 

TOTAL 

503 (503) 

497 

(497) 

1000 (1000) 


PROGRAMS: FIB, ADDSUB, QUAD, LINCON 


PARTITION 

S-a 

-0 

S-a 

-1 

COMBINED 

1 

49 

(48) 

45 

(49) 

94 

(97) 

2 

90 

(78) 

78 

(80) 

168 

(159) 

3 

46 

(52) 

66 

(53) 

112 

(106) 

4 

53 

(63) 

57 

(65) 

no 

(129) 

5 

52 

(46) 

52 

(48) 

104 

(94) 

6 

6 

(8) 

6 

(8) 

12 

(16) 

TOTAL 

296 

(296) 

304 (303) 

600 (601) 


( ) = Theoretical 



TABLE 3b 


PHASE I EXPERIMENTS 

NUMBER OF COMPONENT»LEVEL FAULTS INJECTED BY PARTITIONS 


PROGRAMS: FETSTO, FIB, ADDSUB, SERCOM, QUAD. LINCON 


PARTITION 

S-a-0 

S-a-1 

COMBINED 

1 

38 (39) 

39 (40) 

77 (79) 

2 

58 (64) 

79 (66) 

137 (130) 

3 

52 (42) 

42 (44) 

94 (86) 

4 

49 (52) 

43 (53) 

92 (105) 

TOTAL 

197 (197) 

203 (203) 

400 (400) 


( ) * Theoretical 



TABLE 4a 


PHASE II EXPERIMENTS 

NUMBER OF GATE-LEVEL FAULTS INJECTED BY PARTITIONS 


PARTITION 

S-a 

-0 

S-a 

-1 

COMBINED 

1 

17 

(24) 

17 

(24) 

34 

(48) 

2 

35 

(M) 

39 

(40) 

74 

(80) 

3 

28 

(27) 

27 

(26) 

55 

(53) 

4 

40 

(32) 

34 

(32) 

74 

(64) 

5 

25 

(24) 

25 

(23) 

50 

(47) 

6 

7 

(4) 

6 

(4) 

13 

(8) 

TOTAL 

152 

(151) 

148 

(149) 

300 

(300) 


TABLE 4b 

PHASE II EXPERIMENTS 

NUMBER OF COMPONENT-LEVEL FAULTS INJECTED BY PARTITIONS 


PARTITION 

S-a 

-0 

S-a-1 

COMBINED 

1 

15 

(18) 

20 (20) 

35 (37) 

2 

38 

(29) 

35 (32) 

73 (61 ) 

3 

21 

(19) 

22 (22) 

43 (41 ) 

4 

15 

(23) 

23 (26) 

38 (50) 

TOTAL 

89 

(89) 

100 (100) 

189 (189) 


( ) = Theoretical 
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FAULTS 
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'*3 
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H 
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m 

m 
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BI 

PI 

PI 

m 

n 

PI 

50 

37 

■ 

4 

2 

1 

0 

0 

0 

0 

■ 

0 

0 


0 

0 

82 

76 

P2 

46 

72 

■ 

10 

11 

0 

0 

B 

0 

0 

H 

2 

0 

H 

0 

3 

147 

127 

P3 

24 

34 

8 

3 

0 

0 

0 

B 

0 

0 

B 

0 

0 

B 

2 

2 

87 

98 

P4 

16 

19 

1 

6 

6 

1 

0 

B 

0 

0 

0 

0 

B 

1 

0 

0 

98 

108 

P5 

0 

1 

0 

2 

0 

0 

0 


0 

0 

0 

0 

B 

0 

0 

0 

75 

75 

P6 


0 

0 

1 

0 


0 

0 

0 


0 

0 

0 

0 

0 


14 

wm 

mmm 

136 

163 

22 

26 

19 

2 

m 

m 

0 

0 

4 

2 

0 

2 

2 

5 

503 

497 


in^ = detected S-a-0 faults, 1th cell 


= detected S-a-1 faults, Uh cell 


■pi 


COMPONENT-LEVEL FAULTS 


PARTITION 
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■zmsaii 

■oiisiSi 

■-‘■" 1 . ■■ 


nip 

"2 

m 3 

"3 

"'4 

"4 


"5 


% 

*"7 

"7 


"8 

in 

n 

PI 

22 

27 

2 

■ 

3 

1 

0 

0 

m 

0 

0 

0 

0 

0 

B 

0 

38 

39 

P2 

36 

49 

B 

B 

2 

2 

1 

0 

B 

0 

0 

0 

5 

0 

0 

2 

58 

79 

P3 

18 

19 

1 

1 

0 

0 

0 

0 


m 

0 

0 

0 

0 


0 

52 

42 

P4 
P5 
. P6 

16 

18 

1 

1 

2 

0 

0 

1 

1 

0 

0 

0 

0 

1 

1 

1 

49 

43 

TOTAL 

92 

113 

12 

22 

7 

3 

1 

0 

0 

0 

0 

0 

5 

0 

1 

2 

197 

203 
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FETSTO 
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FIGURE 2b 
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COMBINED GATE-LEVEL FAULTS IN PARTITION #1 
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FIGURE 2c 
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FIGURE 2d 
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COMBINED GATE-LEVEL FAULTS IN PARTITION « 
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FIGURE 2e 
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FIGURE 2g 
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FIGURE 2h 
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ADDSUB LATENCY DATA 


GATE-LEVEL FAULTS 


PARTITION 

_ . __ . DETECTED FAUL1 

rs . 

n 

n 

"*2 

"2 


"3 

■"4 

"4 


"5 

PI 

26 

27 

0 

0 

4 

0 

0 

0 

0 

0 

P2 

32 

46 

7 

o 

0 

3 

0 

0 

■ 

0 

P3 

17 

27 

0 

■ 

2 

2 

0 

0 

H 

0 

P4 

9 

17 

0 

■ 

4 

2 

0 

2 

■ 

i 

P5 

0 

0 

0 

0 

0 

1 

0 

0 

H 

0 

P6 

0 

0 

0 

0 

0 

0 

0 

0 


0 

...TOTAL 

84 

117 

7 

9 

10 

8 

0 

2 

2 

1 


= detected S-a-0 faults, ith cell n-j = detected S-a- 


cn 

CO 


COMPONENT-LEVEL FAULTS 


PARTITION 

DETECTEI 

FAUn 

[S 

‘"1 

"l 


"2 

m3 

"3 

"•4 



"6 

PI 

22 

23 

B 

0 

2 

■ 

0 

0 

0 

0 

P2 

35 

51 


10 

1 

B 

0 

0 

0 

0 

P3 

19 

23 


0 

3 


0 

0 

0 

0 

P4 

24 

31 


D 

1 


0 

0 

0 

0 

P5 




B 







P6 



■ 

■ 
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TOTAL 

100 

128 

16 

11 

7 

7 

0 

0 

0 
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AOOSUB 
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FIGURE 3a 


10 
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FIGURE 3b 



6 ' 


AODSUB 



12 3456789 10 

TIME TO DETECT (REPETITIONS) 



FIGURE 3c 
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FIGURE 3d 
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FIGURE 3e 
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FIGURE 3g 
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FIGURE 3h 
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FIGURE 3i 


68 


FUl LATENCY DATA 


GATE -LEVEL FAULTS 


PARTITION 

DETECTEO FAULl 

s _ 

FAULTS 
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■"l 

"l 


"2 


"3 

™4 

"4 

b 

"5 

m , 
0 

"6 

m? 

"7 

"*8 

"a 

m 

n 

PI 

30 

29 

D 

1 

0 

0 

0 

0 

0 

0 

0 

B 

0 

B 

0 

0 

49 

45 

P2 

33 

48 

D 


1 

2 

0 

0 

0 

1 

0 

0 

0 

0 

B 

0 

90 

78 

P3 

17 

27 

B 

D 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

B 

0 

46 

66 

P4 

10 

15 

3 

B 

0 

1 

B 

0 

0 

D 

0 

1 

0 

0 

1 

D 

53 

57 

P5 

0 

1 

0 

1 

0 

0 

B 

0 

0 

0 

0 

0 

0 

0 

B 

0 

52 

52 

P6 

0 

0 

0 

0 

0 

0 

H 

0 

0 

0 

0 

0 

0 

0 

B 

0 

6 

6 
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90 
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19 

14 

B 

3 

B 

0 

0 

B 
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0 

1 

0 
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P5 
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1 

49 
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FIGURE 7e 
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LINCON 


COMBINED GATE-LEVEL FAULTS 
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LINCON 


16 


LINCON 
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FAILURES 


COMBINED COMPONENT-LEVEL FAULTS IN PARTITION #1 
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FIGURE 7i 


GATE-LEVEL FAUl^TS 


PROGRAM 

S-a-j5 Faults 

r — 

S-a-l Faul ts 

INJECTED 

DETECTED 

PERCENT 

DETECTED 

1st 

REPETITION 

PERCENT 

UNDETECTED 

INJECTED 

DETECTED 

PERCENT 

DETECTED 

1 St 

REPETITION 

PERCENT 

UNDETECTED 

FETSTO 

503 

136 

27.0 

63.6 

497 

1 63 

32.8 

59.8. 

A0DSU8 

296 

84 

23.4 

64.2 

304 

117 

33.5 

54.9 

FIB 

296 

90 

30.4 

62.2 

304 

120 

39.5 

54.3 

SERCOM 

503 

183 

36.4 

63.6 

497 

212 

42.7 

57.3 

QUAD 

296 j lU 

38.5 

57.4 

304 

145 

47 .8 

49.3 

LINCON 

i 

296 j 148 

50.0 

50.0 

304 

162 

53 .3 

46.7 


CQMPONENT-LEVEl FAULTS 




S-a-0 

Faul ts 



S-a-l 

Fau 1 ts 


PROGR.AM 

INJECTED j 

DETECTED 

PERCENT 

DETECTED 

1st 

REPETITION 

PERCENT 

UNDETECTED 

INJECTED 

DETECTED 

PERCE.'IT 

DETECTED 

1st 

REPETITION 

PERCENT 

U.NOETECTED 

FETSTO 

7 97 1 

92 

46.7 

40. 1 

203 

1 1 3 

55.7 

31.0 

ADDSU3 

1 97 ! 

1 

1 00 

30.8 

37.1 

203 

1 23 

63 . i 

27.6 

FI3 

i 

1 97 i 

111 

56.3 

33.5 

203 

,34 

66.0 

22.7 

SERCOM 

197 1 

1 1 2 


43.1 

203 

1 47 

72.4 

37.6 

QUAD 

1 97 j 

1 33 

67.5 

26.9 

203 

1 54 

75.9 

20.2 

LINCON 

i 

''' ! 

146 

74.1 

25.9 

203 

1 60 

■ 1 

78.8 



21 .2 


SUMMARY OF PHASE I RESULTS 
TABLE 
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iiiim nil ■iiiiiB null iiiiiiiii llll■l■llllllll■l|| nn iiiii iiflimiHHIlill 




no 


GATE-LEVEL FAULTS (86J/. DETECTION) 


WlliTlTION 

detected 

FAULTS 

"TaOTTS — 

INJECTED 

“njETEtTro — 

FAULTS 

rnTTu: 

— TDT7I 

""l 

"l 

n 

n 

TEST ^ 
KNOWN 




PI 

M 

14 

16 

16 

1 

27 

28 

32 

P2 

86 

32 

30 

35 

25 

33 

58 

65 

P3 

20 

19 

20 

19 

17 

22 

39 

39 

P4 

33 

26 

37 

31 

43 

16 

59 

68 

P5 

2 

7 

8 

9 

7 

2 

9 

17 

P6 

3 

2 

4 

4 

2 

3 

5 

8 

TOTAL 

98 

100 

115 

114 

95 

103 

198 

229* 


m = S-a-0 faults 
n = S-a-1 faults 

* 71 faults were disc|ualified as indistinguishable 


(COMPONENT-LEVEL FAULTS (97. 7X DETECTION) 


PARTITION 

DETECTED 

FAULTS 

FAULTS 

INJECTED 

DETECTED 

FAULTS 

TOTAL 

TOTAL 

n 

n 

m 

n 

|¥an|B 

IHlil 

WILD 

BRANCH 

DETECTED 

INJECTED 

PI 

15 

19 

15 

20 

5 

29 

34 

35 

P2 

38 

34 

38 

35 

35 

37 

72 

73 

P3 

20 

21 

21 

22 

16 

25 

41 

43 

P4 

15 

23 

15 

23 

23 

15 

38 

38 

P5 









P6 









TOTAL 

88 

97 

89 

luo 

79 

106 

185 

189* 


* 11 faults viere disqualified as indistinguishable. 
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TABLE 13 


FAULT DETECTION BY THE INDIVIDUAL TESTS 


COMPONENT* LEVEL GATE-LEVEL 


TEST # - 1 

FREa 

TEST # - 1 

FREa 

0 

106 

0 

103 

1 

2 

1 

1 

2 

6 

2 

1 

3 

2 

3 

2 

5 

7 

4 

1 

6 

4 

5 

4 

7 

4 

6 

2 

8 

5 

7 

4 

9 

1 

8 

9 

10 

2 

10 

3 

11 

1 

12 

1 

18 

4 

16 

1 

19 

2 

18 

9 

21 

1 

19 

4 

22 

2 

21 

2 

23 

1 

22 

IS 

28 

2 

24 

1 

35 

3 

25 

1 

37 

1 

28 

1 

38 

1 

33 

1 

44 

2 

34 

1 

49 

1 

35 

3 

69 

2 

36 

1 

95 

1 

38 

1 

102 

1 

39 

1 

108 

1 

43 

3 

no 

1 

53 

1 

144 

1 

92 

1 

187 

2 

98 

1 

241 

15^^^ (UNDETECTED) 

100 

1 


— 

102 

1 


184 

103 

1 



112 

1 

OUT OF RANGE 


114 

2 

TEST » 

16 

177 

1 


— — 

236 

1 


200 

239 

1 



241 

102^^^ (UNDETECTED) 



OUT OF RANGE 




TEST # 

10 


300 


(1} 11 wert subsequently disqualified as Indistinguishable 
(2) 71 were subsequently disqualified as Indistinguishable 

m 




GATE-LEVEL FAULTS 
































T»Exact estimate 
N»Approx1mate estimate 
A*Emp1r1cal value 


test 0-a) (a) P Po 

FETSTO/GATE COMBINED T 0.5614 0.4336 0.7776 0.3845 

N 0.5359 0.4641 0.7807 0.383 

A 


S-a-J T 0.5168 0.4832 0.7413 0.3647 

N 0.5 0.5 0.7432 0.3638 

A 


S-a-1 T 0.6137 0.3863 0.8099 0.4049 

N 0.575 0.425 0.815 0.402 

A 


FETSTO/COMP COMBINED T 0.5280 0.4720 0.7927 0.6465 

N 0.5093 0.4907 0.7946 0.6450 

A 


S-a-J> T 0.6603 0.3397 0.7698 0.6066 

N 0.6061 0.3939 0.7797 0.5990 

A 


S-a-1 T 0,3594 0.6406 0.8070 0.6898 

N 0.3571 0.6429 0.8071 0.6397 

A 


FIS/GATE COMBINED T 0.3177 0.6823 0.8366 0.4184 

N 0.3167 0.6833 0.3367 0.4133 

A 


S-a-J) T 0.2909 0.7091 0.8035 0.3784 

N 0.2903 0.7097 0.3036 0.3784 

A 


S-a-1 T 0.3466 0.6534 0.8632 0.4573 

N 0.3448 0.6552 0.8633 0.4572 

A 


FIB/COMP COMBINED T 0.2958 0.7042 0.8507 0.7200 

N 0.2951 0.7049 0.8507 0.7200 

A 


S-a-0 T 0 1.000 0.8473 0.6650 

N 0 1,000 0.8473 0.5650 

A 


S-a-1 T 0.4468 0.5532 0.8531 0.7733 

N 0.4390 0.5610 0.3535 0.7734 

A 


ADDSUB/GATE COMBINED T 0.5309 0.4691 0.3254 0.4058 

N 0.5116 0.4884 0.8272 0.4050 

A 


S-a-0 T 0.6051 0.3949 0.7874 0.3604 

N 0.5686 0.4314 0.7925 0.3581 

A 


S-a-1 T 0.4353 0.5647 0.3536 0.4509 

N 0.4286 0.5714 0,3540 0.4507 

A 


A00SU8/C0MP COMBINED T 0.3401 0.6599 0.3413 0.6776 

N 0.3385 0.6515 0.3413 0.6775 

A 


S-a-0 T 0.3153 0.6847 0.3064 0.6295 

N 0.3143 0.6857 0.3065 0.6294 

A 


S-a-1 T 0.3693 0.6307 0.3706 0.7242 

N 0.3667 0.6333 0.8707 0.7241 

A 
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'occupancy probabilities 


Cell * 

Cell * 

CeTT * 

Cell * 

Cell - 

Call * 

■“ Ci 11 *'“ 

Cell * 

Cell * 

#1 

#2 

#3 

#4 

#5 

#6 

#7 

^3 

49 

0.299 

0.038 

0.021 

0.012 

0.007 

0.004 

0.002 

0.001 

0.617 

0.299 

0.039 

0.021 

0.011 

0.006 

0.003 

0.002 

0.001 

0.617 

0.299 

0.048 

0.021 

0 

0 

0.006 

0.002 

0.007 

0.617 

0.270 

0.046 

0.024 

0.012 

0.006 

0.003 

0.002 

0.001 

0.6362 

0.270 

0.047 

0.023 

0.012 

0.006 

0.003 

0.001 

0.001 

0.6362 

0.270 

0.044 

0.038 

0 

0 

0.008 

0 

0.004 

0.636 

0.328 

0.030 

0.018 

0.011 

0.007 

0.004 

0.003 

0.002 

0.5976 

0.323 

0.032 

0.018 

0.010 

0.006 

0.003 

0.002 

0.001 

0.598 

0.328 

0.052 

0.004 

0 

0 

0.004 

0.004 

0.010 

0.598 

0.5125 

0.063 

0.033 

0.018 

0.009 

0.005 

0.003 

0.001 

0.3550 

0.5125 

0.065 

0.033 

0.017 

0.009 

0.004 

0.002 

0.001 

0.355 

0.513 

0.085 

0.025 

0.003 

0 

0 

0.013 

0.008 

0.355 

0.467 

0.047 

0.031 

0.021 

0.014 

0.009 

0.006 

0.004 

0.4010 

0.467 

0.052 

0.032 

0.019 

0.012 

0.007 

0.004 

0.003 

0.4010 

0.467 

0.061 

0.036 

0.005 

0 

0 

0.025 

0,005 

0.401 

0.557 

0.085 

0.031 

0.011 

0.004 

0.001 

0.001 

0.000 

0.3103 

0.557 

0.086 

0.031 

0.011 

0.004 

0.001 

0.001 

0.000 

0.3103 

0.557 

0.108 

0.015 

0 

0 

0 

0 

. 0.010 

0.310 


0.350 

0.047 

0.015 

0.005 

0.001 

0 

0 

0 

0.5816 

0.350 

0.047 

0.015 

0.005 

0.001 

0 

0 

0 

0.5817 

0.350 

0.055 

0.007 

0.002 

0.002 

0.002 

0 

0.002 

0.582 

0.304 

0.053 

0.015 

0.004 

0.001 

0 

0 

0 

0.6216 

0.304 

0.053 

0.015 

0.004 

0.001 

0 

0 

0 

0.6216 

0.304 

0.064 

0.003 

0.003 

0 

0 

0 

0.003 

0.622 

0.395 

0.041 

0.014 

0.005 

0.002 

0.001 

0 

0 

0.5427 

0.395 

0.041 

0.014 

0.005 

0.002 

0.001 

0 

0 

0.5428 

0.395 

0.046 

0.010 

0 

0.003 

0.003 

0 

0 

0.543 

0.613 

0.076 

0.022 

0.007 

0.002 

0.001 

0 

0 

0.280 

0.613 

0.076 

0.022 

0.007 

0.002 

0.001 

0 

0 

0.280 

0.613 

0.090 

0.008 

0.003 

0.003 

0.003 

0 

0.003 

0.280 

0.5635 

0.1015 

0 

0 

0 

0 

0 

0 

0.335 

0.5635 

0.1015 

0 

0 

0 

0 

0 

0 

0.335 

0.563 

0.102 

0 

0 

0 

0 

0 

0 

0.335 

0.660 

0.063 

0.028 

0.013 

0.006 

0.003 

0.001 

0.001 

0.2266 

0.660 

0.064 

0.028 

0.012 

0.005 

0.002 

0.001 

0 

0.2266 

0.660 

0.079 

0.015 

0.005 

0.005 

0.005 

0 

0.005 

0.227 


0.335 

0.033 

0.018 

0.009 

0.005 

0.003 

0.001 

0.001 

0.5950 

0.335 

0.034 

0.018 

0.009 

0.005 

0.002 

0.001 

0.001 

0.5950 

0.335 

0.027 

0.030 

0.003 

0.005 

0.003 

0.002 

0 

0.595 

0.284 

0.030 

0.018 

0.011 

0.007 

0.004 

0.002 

0.001 

0.6419 

0.234 

0.032 

0.018 

0.010 

0.006 

0.003 

0.002 

0.001 

0.6419 

0.284 

0.024 

0.034 

0 

0.007 

0.007 

0.003 

0 

0.642 

0.385 

0.037 

0.016 

0.007 

0.003 

0.001 

0.001 

0 

0.5493 

0.385 

0.038 

0.016 

0.007 

0.003 

0.001 

0.001 

0 

0.5493 

0.335 

0.030 

0.026 

0.007 

0.003 

0 

0 

0 

0.549 

0.570 

0.071 

0.024 

0.008 

0.003 

0.001 

0 

0 

0.3225 

0.570 

0.071 

0.024 

0.008 

0.003 

0.001 

0 

0 

0.3225 

0.570 

0.068 

0.035 

0 

0 

0.005 

0 

0 

0.323 

0.508 

0.083 

0.026 

0.008 

0.003 

0.001 

0 

0 

0.3705 

0.508 

0.084 

0.026 

0.008 

0 . C 03 

0.001 

0 

0 

0.3706 

O.SOS 

0.081 

0.036 

0 

0 

0.005 

0 

0 

0.371 

0.631 

0.059 

0.022 

0.008 

0.003 

0.001 

0 

0 

0.2759 

0.631 

0.059 

0.022 

0.008 

0.003 

0.001 

0 

0 

0.2759 

0.631 

0.054 

0 . 034 - 

0 

r \ 

0.005 

0 

0 

0.276 
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FIGURE lOd 





ro 

00 


TEST 


”(PP) 

(aa) 

”{PoPo) 

'’(Pa) 

^(PPo) 

'’(aPo) 

FETSO/GATE 

COMBINED 

.45874E-3 

.18519E-2 

.23886E-3 

.71435E-4 

-.44562E-5 

-.35322C-4 


S-a-0 

.10553E-2 

.33012E-2 

.46305E-3 

.85297E-4 

-.48770E-5 

-.41971E-4 


S-a-1 

.80100E-3 

,42498E-2 

.49385E-3 

.25294E-3 

-.18026E-4 

-.12646E-3 

FETSTO/COHP 

COMBINED 

.64313E-3 

. 29281 E-2 

.57644E-3 

.73853E-4 

-.62756E-5 

-.60236E-4 


S-a-P 

.16216E-2 

.61628E-2 

.12976E-2 

.67009E-3 

-.1094BE-3 

-.52802E-3 


S-a-I 

.11130E-2 

.57I03E-2 

.10547E-2 

.13421E-4 

-.62136E-6 

-.11471E-4 

FIB/GATE 

COMBINED 

.54472E-3 

.36941 E-2 

.40560E-3 

.36381 E-5 

-.76297E-7 

-.18193E-5 


S-a-lJ 

.14096E-2 

.67498E-2 

.79468E-3 

.45248E-5 

-.95559E-7 

-.21307E-5 


S-a-1 

.84973E-3 

.8081 7E-2 

.81649E-3 

.11592E-4 

-.24245E-6 

-.61405E-5 

FIB/COMP 

COMBINED 

.44I14E-3 

.34677E-2 

.50403E-3 

.20647E-5 

-.63936E-7 

-.17476E-5 


S-a-0 








S-a-1 

.80075E-3 

.67568E-2 

.86450E-3 

.47317E-4 

-.25023E-5 

-. 42921 E-4 

AOUSUB/GATE 

COMBINED 

.59953E-3 

.36954E-2 

.40378E-3 

.84486E-4 

-.38398E-5 

-.41539E-4 


S-a-P 

.16319E-2 

.71298E-2 

.791 92E-3 

.42279E-3 

-.28793E-4 

-.19350E-3 


S-a-1 

.91413E-3 

.77717E-2 

.81514E-3 

.46365E-4 

-.13664E-5 

-.24488E-4 

ADDSUB/COMP 

COMBINED 

.49296E-3 

.35619E-2 

.54633E-3 

.51528E-5 

-.18170E-6 

-.41 501 E-5 


S-a-0 

.12593E-2 

.63014E-2 

.11841E-2 

.67686E-5 

-.25128E-6 

-.52836E-5 


S-a-1 

.76671E-3 

.81346E-2 

.98431 E-3 

.16280E-4 

-.54722E-6 

-.13542E-4 
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TEST 


^(PP) 


FETSTO/GATE 

COMBINED 

P193.2 

544.74 


S-a-P 

949. 6S 

303.89 


S-a-1 

1272.7 

241 .53 

FETSTO/COMP 

COMBINED 

1559.5 

343.23 


S-a-P 

646.59 

175.28 


S-a-1 

890.52 

175.13 

FIB/GATE 

COMBINED 

1835.8 

270.70 


S-a-p 

709.41 

148.15 


S-a-1 

1176.9 

123.74 

FIB/COMP 

COMBINED 

2266.8 

288.38 


S-a-p 

1012.7 

0.0 


S-a-1 

1249.4 

148.11 

ADOSUB/GATE 

COMBINED 

1673.4 

271.79 


S-a-p 

622.48 

143.34 


S-a-1 

1094.3 

128.72 

ADDSUB/COMP 

COMBINED 

2028.6 

280.75 


S-a-p 

794.12 

158.70 


S-a-1 

1304.3 

122.94 


jj(PoPo) 

„(Pa) 

^(PPo) 

„(aPo) 

4198.7 

-84.059 

20.486 

78.984 

2162.2 

-24.438 

7.7870 

’7.287 

2041.0 

-74.940 

27.266 

59.115 

1738. 6 

-39.068 

12.896 

35.441 

799.62 

-68.002 

26.880 

65.588 

948.18 

- 2.1108 

.50641 

1.9035 

2465.5 

- 1.8078 

.33723 

1.2139 

1258.4 

- .47553 

.084030 

.39718 

1224.8 

- 1.6878 

.33677 

.93009 

1984.0 

- 1.3496 

.28287 

.99971 

884.27 

0.0 

0.0 

0.0 

1157.1 

- 8.7288 

3.1829 

7.3280 

2479.6 

-38.124 

11.992 

27.598 

1271.5 

-36.540 

13.704 

33.694 

1226.9 

-6.5232 

1,6383 

3.8560 

1830.4 

-2.9338 

.65239 

2.1317 

844.50 

- .85287 

.16471 

.70792 

1016.0 

-2.6093 

.68922 

1.6900 


TABLE 17 INVERSE ERROR COVARIANCE MATRIX ELEMENTS FOR URN MOOEL ESTIMATES 



a 3 




m«!; m . 
t.f ^ 

"I 

Iin 

i 

i»i 


A 

8 

C 

0 

FETSO/GATE 

COMBINED 

1000 

299 

383 

564 

-407 

491 

-181 

97 


S-a-0 

503 

136 

183 

277 

-235 

282 

-94 

47 


S-a-1 

497 

163 

200 

287 

-172 

209 

-87 

50 

FETSO/COMP 

COMBINED 

400 

205 

258 

366 

-263 

316 

-108 

55 


S-a-0 

197 

92 

118 

184 

-116 

142 

-66 

40 


S-a-1 

203 

113 

140 

182 

-147 

174 

-42 

15 

FIS/GATE 

COMBINED 

600 

210 

251 

311 

-227 

268 

-60 

19 


S-a-0 

296 

90 

112 

143 

-123 

145 

-31 

9 


5-a-l 

304 

120 

139 

168 

-104 

123 

-29 

10 

FIB/COMP 

COMBINED 

400 

245 

288 

349 

-240 

233 

-61 

18 


S-a-0 

197 

111 

131 

151 

-120 

140 

-20 

0 


S-a-1 

203 

134 

157 

198 

-120 

143 

-41 

IS 

ADDSUB/GATE 

COMBINED 

600 

201 

243 

329 

-208 

250 

-86 

44 


S-a-0 

296 

84 

106 

157 

-103 

125 

-51 

29 


S-a-1 

304 

117 

137 

172 

-105 

125 

-35 

15 

ADOSUS/COMP 

COMBINED 

400 

228 

271 

336 

-236 

279 

-65 

22 


S-a-0 

197 

100 

124 

159 

-133 

157 

-35 

11 


S-a-1 

203 

123 

147 

177 

-103 

122 

-30 

11 


TABLE 1 8 INTERMEDIATE URN MODEL PARAMETER ESTIMATES 
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6.0 SUMMARY OF EXPERIMENTS 


6.1 Phase I Experiments 

From the results of the previous section we observe that 

• Most detected faults are detected in the first repetition. Subsequent 
repetitions do not appreciably increase the proportion of detected 
faul ts . 

• S-a-1 faults are easier to detect than S-a-0 faults. 

• The micromemory (i.e.. Partition #5) contains a large proportion of 
indistinguishable faults. 

• Faults in memory units (i.e.. Partitions #5, #6) are difficult to detect 

• A large proportion of faults remain undetected after as many as 8 repe- 
ti tions . 

• Component-level faults are easier to detect than gate-level faults. 

• The coverage estimates of the Phase I experiments are not corrected for 
indistinguishable fault content. 

Subsequent analysis of undetected faults indicates that the proportion of 
indistinguishable faults at the gate-level is 23.66% and 5.5% at the component- 
level. The combined, S-a-1 and S-a-0 coverage estimates should be corrected by 
dividing the raw coverage by 1-y* where 

1 - Y* = .7633 for gate-level coverage 

= ,945 for component-level coverage 

As an example consider the raw coverage in the FETSTO experiment. The 
uncorrected data indicates 29.9% detection in the 1st repetition. The corrected 
coverage is, in fact, 39.17%. The 61.7% undetected is corrected to 49.84%. 

The poor detection coverage of the six programs of Phase I is not surpris- 
ing particularly if one considers that Self-Test, which exercises a much great- 
er mix and quantity of instructions, achieve 86.5% detection (at the gate-level) 
Table 19 shows the instruction mix and quantity of instructions executed versus 
coverage for each of the six programs. By contrast, Self-Test exercises almost 
the entire instruction set of the CPU and executes approximately 2000 instruc- 
tions in a single pass. 

In the present study no attempt was made to evaluate the coverage capabil- 
ity of each instruction of the Phase I experiments. As a consequence, it is 
difficult to correlate instructions mix and coverage. However, the number of 
executed instructions was plotted versus coverage for each of the programs for 
gate-level and component-level faults. These results are given in Figures 11 
and 12, respectively. The proportions of undetected faults in the QUAD, SERCOM 
and LINCON experiments were obtained by extrapolating to 8 repetitions, by the 
method described in Sections 5.2.4, 5.2.5 and 5.2.6. 
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Referring to the figures, the proportion of faults detected in the first repe- 
tition and the proportion of undetected faults after 8 repetitions are linear 
functions of the number of executed instruction, at least in the range of values 
considered. It is unlikely, however, that this trend will continue for very 
large numbers of executed instructions. 

The relatively high coverage of S-a-1 faults can be rationalized as fol- 
lows. The most significant bits of most arithmetic registers are normally zero. 
Thus, a S-a-1 fault in one of these bits will be detected whenever the contents 
of the register are used, whereas, a S-a-0 fault will only be detected when the 
faulted bit is exercised to its complement. 

6.2 Phase II Experiments 

From the results of the Phase II (i.e., Self-Test) experiments we observe 

• There is a significant difference in coverage of gate-level versus 
component-level faults, e.g., after disqualifying indistinguishable 
faults gate-level fault coverage was 86.5% whereas component-level fault 
coverage was 97.9%. 

• There was a large proportion of indistinguishable faults in the gate- 
level emulation, e.g., 23.7%. The worst offender was the micromemory 
which yielded 33 indistinguishabl e faul ts out of a total of 41 selected. 

• Only 48% of all detected faults were detected by an explicit test, i.e., 

95 out of 198. 103 faults were detected because the fault resulted in a 

wild branch, i.e., a jump out of the first test. 

t Most of the 241 tests comprising Self-Test were redundant; only 46 tests 
resulted in a detection. 

t Of the 95 faults detected by an explicit test 59 were detected by the 
first 23 tests. 

• This particular Self-Test was designed to exercise an instruction set 
rather than explicit hardware. An noted in Section 7, this approach 
results in an inefficient Self-Test since, it turned out, most of the 
tests exercised the same hardware. 

6 . 3 Urn Model Distributions 

From previous studies and results of experiments we make the following 
observations regarding the Urn Model. 

• Despite its simplicity the Urn Model results in good correlation with 
all of the empirical distributions of the study. This is not surprising 
considering that the model has 3 degrees-of- freedom available for a best 
fit, i.e., P, pQ and a, and the empirical distributions are heavily 

weighted in the first, second and last latency cells. As indicated in 
Section 9, other distributions could be conjectured that would yield 
equally good correlation. 
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t The generalized Urn Model, defined in Section 9, was not evaluated in the 
study. However, because faults were identified by partition, it is pos- 
sible to obtain a rough estimate of g (a) by assuming a constant proba- 
bility of detection in each partition and estimating the corresponding 
Urn Model parameter, a, suitably weighted by the failure rate of the 
partition . 
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% On detected IN 1ST REPETITION 

DETECTED/UNDETECTED A= UNDETECTED AFTER 8 REPETITIONS 



NO. OF EXECUTED INSTRUCTIONS 

FIGURE 11 DETECTED COMBINED FAULTS VS. NO. OF EXECUTED INSTRUCTIONS 

(GATE-LEVEL FAULTS) 



DETECTED/UNDETECTED 


70 


60 



50 


© 


FIGURE 12 DETECTED COMBINED FAULTS VS. NO. OF EXECUTED INSTRUCTIONS 

(COMPONENT-LEVEL FAULTS) 


40 


30 


20 


10 


O = DETECTED IN 1ST REPETITION 
A = UNDETECTED AFTER 8 REPETITIONS 



FETSTO 

A 


ADDSUB 

A 


NO. OF EXECUTED INSTRUCTIONS 
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TYPE OF 

INSTRUCTION 


GATE- 

LEVEL 

i COMPONENT LEVEL 1 

PROGRAM 

TOTAL # OF 
EXECUTED 
INSTRUCTIONS 

LOAD AND 
STORE 

ADD AND 
SUBTRACT 

BRANCH 

TRANSFER 

CLEAR 

PERCENT 
DETECTED 
1 St 

REPETITION 

PERCENT 

UNDETECTED 

PERCENT 
DETECTED 
1 st 

REPETITION 

PERCENT 

UNDETECTED 

FETSTO 

6 

3 

1 

2 

0 

0 

29.9 

61.7 

51.3 

35.5 

ADDSUB 

11 

4 

3 

2 

2 

0 

33.5 

59.5 

57.0 

32.3 

FIB 

11 

3 

3 

4 

0 

1 

35.0 

58.2 

61 .3 

28.0 

QUAD 

87 

12 

31 

38 

6 

0 

43.2 

53.3 

71 .8 

23.5 

SERCOM 

59 

12 

18 

24 

5 

0 

39.5 

60.5 

64.8 

35.3 

LINCON 

147 

76 

20 

39 

11 

1 

51.7 

48.3 

76.5 

23.5 


Note: This table is based upon one pass through the main program. 


INSTRUCTION MIX versus DETECTION in PHASE I EXPERIMENTS 


Table 19 






7.0 ANALYSIS OF UNDETECTED FAULTS 

7.1 Phase I Experiments 


Because of the large numbers of undetected faults in the Phase I experi- 
ments it was not practicable to determine why each undetected fault was not 
detected. However, from the breakdown of faults by partitions, from the nature 
of the Phase I programs and from the analysis of undetected faults in the 
Phase II experiments it is possible to assess, in general terms, why faults were 
not detected in the Phase I experiments. 

7.1.1 Undetected Faults in Phase I 

t From the Phase I latency distributions of Section 5 it can be seen that 
most undetected faults occurred in the micromemory (i.e.. Partition #5). 
This was the result of the limited instruction sets used by the Phase I 
programs and the large proportion of industinguishabl e faults contained 
in the micromemory. As an example, the BDX-930 contains 79 types of 
macroinstructions but no Phase I program used more than 10. Moreover the 
Self-Test program, which used almost the entire instruction set, allowed 
41 undetected faults of the micromemory of which 33 were indistinguisha- 
ble. Thus, about 80^ of all micromemory faults are indistinguishable. 

i The 2901 chips were another prime source of undetected faults, as can be 
seen from the latency distribution for Partition #4. Excluding indis- 
tinguishable faults, these faults are associated with the 2901 RAM which 
contains the 16 arithmetic registers. The Phase I programs only exer- 
cised 2 or 3 of these registers so it is not surprising that faults in 
the unused registers were undetected. 

7 .2 Phase II Experiments 

Every undetected fault in the Phase II gate-level experiment was analyzed 
to determine why it was not detected. This turned out to be an exceedingly 
difficult and tedious task that required an intimate knowledge of the computer 
and its operation. As an indication of the magnitude of the task, out of the 
300 faults injected, 102.. were undetected. Out of the 102 undetected faults 71 
were identified as indistinguishable. This, however, did not lessen the task 
since indistinguishable faults were themselves, difficult to identify. 
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7.2.1 Undetected Faults in Phase II 


• The gate-level undetected faults were distributed over the partitions as 
follows : 

PARTITION DISTINGUISHABLE INDISTINGUISHABLE TOTAL 

^ 2 6 

7 9 16 

0 16 16 

9 6 15 

8 33 41 

— 5 8 

31 71 102 

t The micromemory contained the largest number of indistinguishable faults. 
This was due to the following; 

f 3 bits of every 56-bit microword were unused, 

f 100 out of 512 microwords were spares. 

t Approximately 10 of 56 bits in a microword are used in executing an in- 
struction; the remaining bits whether faulted or not, are effectively 
i gnored . 

Indistinguishable faults in the micromemory were especially difficult 
to identify. The problem is that an unused but faulted bit violates 
the ground rules of the hardware design and the resultant effects are, 
therefore, unanticipated. To ascertain the effects of such a bit it is 
necessary to consider all possible scenarios in which the faulted mi- 
croword is used and track the effects of the faulted bit in each in- 
stance. In the present study only the most obvious scenarios were 
analysed. Even so, it required an engineer with considerable expertise 
in the hardware design to perform the task. 

§ In Partition #1 the 4 distinguishable undetected faults were associated 
with the most significant bits of the memory address register. These 
were faults that would have prevented access to memory locations not 
accessed by the Self-Test program. 

t In Partition #2 the 7 distinguishable, undetected faults affected the 
program counter (3), the multiplexer that selected either the program 
counter or the temporary register {2), and the generation of I/O 
strobes (2). 


1 

2 

3 

4 

5 

6 
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As with the memory address register, the faults affecting the program 
counter would have prevented the counter from addressing memory not accessed 
by the Self-Test program. The multiplexer faults caused the temporary register 
to be selected instead of the program counter. The contents of these registers 
are identical during the Self-Test program. 

• In Partition #3 there were no undetected, distinguishable faults. How- 
ever, all faults that affected the power-on sequence and were not de- 
tected, were designated as indistinguishable. Other indistinguishable 
faults affected the unused carry bits of registers. 

• In Partition #4 the 9 distinguishable, undetected faults affected the 
upper arithmetic registers (7) and the register carry bits (2). The 
Self-Test program did not sufficiently exercise the bit patterns in the 
upper arithmetic registers. 

• In Partition #5 the 8 distinguishable, undetected faults were the result 
of not sufficiently exercising the microinstruction set. Because of 

the very large number of variations a thorough test of the micromemory is 
extremely difficult to achieve. In the future, processors should incor- 
porate a micromemory sum check for this purpose. 

• In Partition #6 the 3 distinguishable, undetected faults affected the 
"jump relative" instructions which were not exercised by the Self-Test 
program. Again, it is extremely difficult to exercise these instruc- 
tions in every possible variation. 

7.3 Gate-Level versus Component-Level Faults 

A detailed analysis of the undetected failures indicated that component- 
level faults (i.e., faults at the device pins) are relatively easy to detect. 

In fact, with a little extra care in the Self-Test design there would have been 
no undetected component-level faults. It should be noted, in this regard, that 
no effort was made to modify the Self-Test on the basis of trial runs; the 
initial Self-Test program remained unchanged throughout the study. 

The ease of fault detection at the component-level is not surprising when 
one considers that a single pin is used in a variety of operations and conse- 
quently, will affect many diverse operations when faulted. Two examples are 
(1) the output bits of a microword and (2) the 2901 arithmetic register data 
outputs. A fault of a microword pin-out will affect that bit position in every 
microword. Such a fault will surely be detected whenever a microinstruction 
uses the complementary value of the faulted bit. In the case of the arithmetic 
registers, it is not possible to fail a bit of one of the 16 registers without, 
at the same time, failing this bit in all arithmetic registers. Thus, if the 
faulted bit is suitably exercised by at least one arithmetic register the fault 
will be detected. 

It may be conjectured tha.t, by injecting faults into the approximately 1200 
pins of the BDX-930, an efficient self-test could be designed to achieve 100% 
detection of component-level faults. 
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The major obstacle to detection of gate-level faults is their data-depen- 
dence. A faulted gate node may not manifest itself at a pin-out unless it is 
exercised by an appropriate combination of input and internal state. In parti- 
cular, the fault may not show-up during a test because the test did not create 
the correct conditions. We note, also that such faults are exceedingly diffi- 
cult to analyze. 
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8.0 UNIPROCESSOR BIT 


In the present study faults were limited to the central processor unit of 
the avionics processor. In practice, however, fault analysis must include the 
entire processor. In this section methods of detecting faults which occur 
elsewhere in the processor are examined. 

An avionics processor customarily contains a built-in-test (Bit) proce- 
dure for fault detection and identification of faulted components. A typical 
Bit procedure consists of applying stimuli to the processor circuitry and 
determining whether the resultant responses at designated test points are 
correct. A Bit procedure which utilizes only the resources of a single pro- 
cessor is referred to as a "Uniprocessor Bit". 

There are typically three types of uniprocessor Bit: Bit to detect 

faults 1) prior to take-off, 2) for maintenance purposes and 3) inflight. 

The differences are mainly in the coverage and isolation requirements, time 
available to complete the test and the inclusion or exclusion of other sub- 
systems such as sensors and actuators. In the present discussion we will 
consider only inflight Bit although the methodology is general and applies 
more or less verbatim to all types of Bit. 

Control System Scenario 

In this scenario the purpose of inflight Bit is to eliminate a channel 
of redundancy. The configuration is triplex consisting of three identical 
processors with dedicated sensors. 

The system is designed to maximize survivability subject to the constraint 
of three channels. This is achieved by taking full advantage of the inherent 
self-detection capability of each processor. When a processor detects a fail- 
ure of itself it will either disengage itself from the affected axis but other- 
wise continue all other control functions and computations or, if it is a com- 
putational failure which requires data from the other channels for detection, 
will select the correct data and use it in all other control computations. 

This strategy localizes the effects of a failure and allows the processor to 
perform those remaining control functions and computations that are unaffected 
by the failure. If the processor cannot detect its own failure and take 
correct action then the other processors will cause the errant processor to 
be disengaged from all control axes via dedicated failure logic. 

If maximum survivability is to be obtained it is essential not only to 
detect but isolate a second failure to the failed processor. The self-detection 
capability of each processor insures that a significant proportion of second 
failures may be detected and isolated to the offending processor without the 
need for comparison-monitoring. For those faults that require comparison- 
monitoring for their detection, isolation is achieved by executing an on-line 
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self- test program in each of the contending processors. This application of 
self-test is exactly the same as described in Section 4.6. 

Survivability Benefits of Inflight Bit 

Inflight Bit is comprised of tests which are 1) conducted as part of 
normal inflight redundancy management, 2) callable by the software program or 
3) initiated by processor interrupt. Let us assume that the first failure 
resulted in complete loss of the affected processor. Actually, this is a con- 
servative assumption since the fault might have been isolated to the affected 
axis by the faulty processor. Upon the occurrence of a second fault in one 
of the remaining processors, which cannot be isolated by the errant processor 
by normal redundancy management procedures or by a callable subroutine, then 
the non-faulted, and perhaps even the faulted processor, call for the initia- 
tion of an interrupt. In this event both processors are interrupted (subse- 
quent interrupts are, thereafter, inhibited) to execute an on-line self-test 
which includes the cpu and portions of critical memory. During execution of 
this self-test all control law computations are suspended. If the faulty 
processor detects its own fault then it is disengaged from the system and the 
non-faulted processor assumes complete control. Inability of the faulty pro- 
cessor to detect its own fault is assumed to result in loss of control. 


If 


then 

1 ) 


A = failure rate of a processor 
T = duration of a flight 
1 - a = second failure coverage 

3 (AT)^ a = probability of loss of control with inflight bit, 


2 

2) 3 (AT) = probability of loss of control without inflight bit 

Comparing (1) and (2), it can be seen that inflight Bit improves survivability 
by the factor 1/a. In many control applications this is sufficient to justify 
the elimination of a redundant processor. 

Scope of Inflight Bit 

An examination of the inflight Bit scenario indicates that the procedures 
for isolating the second failure rely entirely on the resources of a single 
processor. The differences between inflight Bit and conventional preflight 
Bit are: 
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1) Inflight Bit is initiated by a mis comparison whereas conventional Bit 
is initiated by an external command. 

2) Inflight Bit is severely constrained by time: it must detect the 

fault before it adversely affects the aircraft and in such a way 
that the processor's ability to control the aircraft is not signifi- 
cantly diminished. As a consequence, inflight Bit does not exercise 
sensors or actuators or other aircraft subsystems. 

As an indication of the scope of inflight Bit the critical components of 
a typical avionics processor are identified in Table 20along with their corres- 
ponding failure rates*. It is noted that only loss of critical components 
affects survivability. Thus, the failure rate of (1) and (2) refer to critical 
components, only. In the target processor of Table 20 critical components 
comprise about 82% of the total processor. 

Referring to Table 20it can be seen that the cpu comprises 10.76% of the 
critical components. If 

1 - a-| = coverage of the cpu 

and 1 - a 2 = coverage of all other critical components 

then the total coverage of critical components is 

3) 1 - a = 0.1076 (1 - a-,) + 0.8924 (1 - a 2 ). 

Fau lt Detection Procedures of Inflight Bit 

A detailed discussion of the fault detection procedures of inflight Bit 
is beyond the scope of this study. The treatment will be limited to a general 
survey. Table 21 indicates the principal tests used to detect faults in selected 
components. These tests include: 

1) CPU Self-Test 

The reader is already familiar with this test. 

2) Watchdog Timer 

The watchdog timer is a frequency sensitive circuit which "times-out" 
unless it is updated by a toggle bit at a fixed frequency. The toggle 
bit alternates between logic 0 and logic 1 and is supplied by the 
software program. One of its uses is to detect variations of the 
real-time clock which- exceed a specified threshold. The principal 
use, however, is to detect a jump out of the program caused either 
by a hardware fault or a software error, either of which prevents 
update of the toggle. 

* From MIL-HDBK 21 7B, Notice 2. 
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3) Parity 

The program and scratchpad memories contain an extra bit in each word 
for parity. The parity is checked, by hardware, after every memory 
read. A failure of parity results either in an interrupt or the 
setting of a flag. 

4) Memory Sum 

The "read-only” memories are subdivided into IK blocks and the sum 
of each block is precomputed and stored. Their sums can be checked 
periodically or as required. 

5) RAM Addressing 

These procedures test column addressing, row addressing and block 
addressing for each IK block of RAM. The tests completely check the 
row and column decoders within each block of RAM. 

6) Wrap-Arounds (Analog Signals) 

The analog outputs at the sample and holds and the valve drive 
amplifiers are fed-back as analog inputs and checked against the 
digital commands. 

7) Bias Inputs to Multiplexers 

Each analog input multiplexer contains a bias input which is settable 
by software. The bias inputs are located at selected pins in such 
a way that a faulty address will input a signal other than a bias 
or at least one multiplexer. 

8) Reconfiguration 

When a critical sensor miscompares and it cannot be isolated to a 
portion of the I/O circuitry both processors call for a reconfigured 
control law which does not use the affected sensor. 

Inflight Bit Design Methodology and Validation 

The methodology of the Bit design is as follows: 

1) Identify potentially critical components based upon 

• anticipated function of the device 

• projected failure modes and effects 

If the criticality of a device is doubtful, assume it is critical. 
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2 ) Identify the failure modes of each critical device at the device-level. 

3) Identify the effects of each failure mode and assess its criticality. 

4) Associate a probability of occurrence with each critical failure 
mode. Disqualify non-critical faults and reduce the failure rate of 
the device accordingly. 

5) For each critical failure mode establish a failure detection procedure 
based upon the anticipated failure effects. This procedure may con- 
sist of a combination of software and additional hardware (e.g., the 
addition of wrap-arounds). 

6) Estimate the level of coverage for each critical device and for the 
total critical system. 

In following this procedure the designer should emphasize components with 
relatively high failure rates; otherwise a disproportionate effort could be 
placed on detecting faults with small probabilities of occurrence. 

The validation of Bit coverage consists, essentially, in an independent 
assessment of stepsr'Cl ) through (6). This assessment is made difficult by 
several factors: 

• The number of possible responses of a digital circuit is large. 

• The detection of most faults is dependent on the operating system 
software; the analyst must be familiar with the failure detection 
procedures which are implemented by that software. 

These difficulties were overcome, at least for the cpu, by emulation 
using the actual self-test software. The difficulties remain, however, for^ 
the rest of the system which constitutes approximatly 90^ of the total criti- 
cal hardware. 
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TABLE 20 


FAILURE RATES OF CRITICAL COMPONENTS 


COMPONENT 

FAILURE RATE (xlO‘^)/HR. 

CPU 

42.94 

Real Time Clock 

2.81 

Interrupt Logic 

4.54 

Program Memory 

9.71 

Scratchpad Memory 

15.62 

Memory Mapped Discretes 

1 ,44 

Memory Parity 

7.82 

I/O Controller (Sequencer & File Memory) 

11.052 

Intercomputer Data Links 

29.592 

AC/DC Inputs 

20.697 

Input Multiplexers & Ampl . 

8.58 

AD Converter 

1 .92 

Input Discretes 

9.11 

DA Converter 

5.08 

Valve Drive Amplifiers 

8.04 

Sample & Hold Circuits 

3.59 

Output Discretes 

15.53 

Failure Logic 

19.63 

Power Supply 

21 .86 

Mi sc. 

0.22 

CPU PC Board/Connector 

34.4 

Memory PC Board/Connector 

33.15 

Analog I/O PC Board/Connector 

23.2 

I/O Controller PC Board/Connector 

19.81 

Servo Ampl . PC Board/Connector 

26.53 

Harness Assy 

22.15 

Total Critical 

399.021 

Total Components 

487.19 


Critical 

Total 


0.82 
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TABLE 21 


INFLIGHT BIT TEST PROCEDURES 



COMPONENT 


METHOD OF FAILURE DETECTION 

1) 

CPU 

1) 

CPU Self- Test 

2) 

Real Time Clock 

2) 

Watchdog Timer 

3) 

Program Memory 

3) 

Parity & Memory Sum Test 

4) 

Scratchpad Memory 

\ 

4) 

Parity 

Redundant Memory 

Addressing and Bit Pattern Tests 

5} 

DA, AD Converters 
Sample and Hold Circuits 
Valve Drive Amplifiers 

> 5} 

Wrap-arounds 

6) 

j 

Input Mul ti pi exers 

6) 

Bias Inputs, Patterns of Faults 

7) 

Discrete Inputs, Outputs 

7) 

Wrap-arounds 
Patterns of Faults 

8a ) 

I/O Controller Sequencer 

8a ) 

Patterns of Faults 

8b) 

I/O Controller File Memory 8b) 

Parity Tests 
Memory Sum Test 

9} 

Intercomputer Data Links 

9) 

Not detectable 

10) 

AC, DC Inputs 

10) 

Reasonableness and Reconfiguration 

11) 

PC Boards and Connectors 

11) 

Detection dependent on affected 
devices 

12) 

Power Supply 

12) 

Level Detectors 
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9.0 URN MODEL 


9.1 Urn Model Description 

Several models have been investigated in an attempt to characterize the dy- 
namics of fault propagation in a digital computer. Although simplistic in their 
assumptions, these models may, nevertheless, provide insight into this undoubt- 
edly complex process. It has been conjectured (ref. 1) that the distribution of 
latency can be modelled by analogy with balls in an urn. We prefer to employ a 
different analogy although the resultant distributions are the same. 

We postulate that the computer can be subdivided into three sets of mutual- 
ly exclusive components C^, C such that 

C-j = Set of components randomly exercised by the program 
= Set of components continually exercised by the program 
= Set of components never exercised by the program. 

We make the further assumption that a fault is detected if and only if the 
faulted component is exercised. The scenario is that of an avionics computer 
executing two software programs one of which is executed full-time and the 
other, part-time. The components that are exercised by the full-time mode are 
denoted by C 2 and those exercised by the part-time mode by . Neither the 

full-time or part-time modes exercise components, C^- 

We assume that the part-time mode is exercised randomly. If the unit of 
time is a repetition of the full-time program then we postulate that the exci- 
tation is poi sson-distri buted in time with a = probability that the part-time 
mode is exercised in a repetition of the full-time program. 

Let = Failure rate of C-j (Failures/hour) 

* Failure rate of C 2 (Failures/hour) 

X^ = Failure rate of (Failures/hour) 

X ® " ^2 *** ^3 (Failures/hour) 

We now derive the latency distribution given that a fault has just occurred. 
The distribution is defined in terms of three parameters, a, P and Qq where 

P = Probability that the fault is detected in the first repetition 
given that it occurred in sets C-j or C 2 

Qq = Probability that the fault is never detected. 
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It is easy to derive the following relationships: 


1) Po*i -Qo- T^x- Qo'x 

\ )r \ X 

2) P = » . 

^ ^ Pq 

X X 


If 


then 


Pj^ = probability that the fault Is detected in the k-th repetition and 

not detected in a previous repetition, k * 1, 2, 3, n, 

* Probability that the fault is not detected in the previous n 
repetitions , 


= Pq P • T * » T 


Pj = (1 - P) a Pq = a (1 - a) 


3) 


» 0 - P) (1 - a)"'^ a P» = a (1 - a)"''’ ^ . n - 2.3.... 


Vi “ V ^ Pk = Qo ^ Po 


n-1 


^ + (1 - a)" ^ , n = 1,2,3.... 


Observe that 


+ Z Pj^ * 1 , as expected 


In estimating the above distribution the number of repetitions will be 
limited to eight. Then, the study will estimate the quantities 

Pi » P2 » • • • • » Pg * qg 

for S-a-1 , S-a-0 and combined faults. 
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9.2 Generalized Urn Model 


One of the deficiencies of the Urn Model is that it assumes that each fault 
in set is exercised with the same probability, a. If this is not the case 

then the distribution cannot be represented by the Urn Model. This can be dem- 
onstrated as follows. 

Subdivide into mutually exclusive sets failure rates 

^11’ ^12* ^®spectively, and such that 


a^. = probability that the components of C.j. are exercised in a repetition 
of the full-time program, for i * 1,2. 

Naturally + X-j 2 = A. 

In this case we easily derive 


11 


Pi ° X X *2 X 


12 


4) 


P2 = n - a^) ” + 0 - ag) 

P3 = *l (1 -«2>'¥ 


1' X 


k-1 


P|^ » a^ n - a^ ) ■ -j- + aj (1 - a^) 


k-1 ^12 


, k = 2,3,4, .. . 


But, according to the Urn Model, the distribution should be representable in 
the form 

Pj^ * a (1 - a)^"^ ^ , k = 2,3,... 


However, there does not exist a value of " 
a (1 . a)^-l ^ = ai O-a,)'-'^ 
for all values of k =* 2,3,4,..., unless a.j 


a" such that 

, ,2 (1 - a^)"-' ^ 
= 0 or a2 = 0 or a^ 


a 


2 * 
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In a real processor It must be presumed that faults in set C-j cannot be 

characterized by a single probability of detection. It is more likely that the 
fault set produces a range of values from zero to one. If this is the case then 
it would appear that the Urn Model is severely restricted in its applicability. 
This is not to say that the model cannot provide a reasonably good description 
of fault latency in a real processor. In fact, the results of Section 5 show 
surprisingly good correlation between the model and the empirical distributions. 
This correlation is more than a coincidence, as we will now demonstrate by a 
comparison with a more elaborate and, hopefully, more realistic model. 

9.2.1 An Alternate Model 

The following derivation is informal and heuristic. No attempt will be 
made to evaluate the resultant model in the present study. Our intention is to 
exhibit the characteristics of a more realistic model for comparison with the 
Urn Model and as a baseline for future studies. 

We associate with each fault in set C] a probability of detection, "a", 
where "a" can have any value on the interval 0 ^ a ^ 1. we postulate the exist- 
ence of a function, g (a), such that 

0 

^ g{a) da =» Y 
a 

where 

Y = probability of occurrence of all faults which yield a value of "a" 
on the interval a < a < 6. 

We observe that 


g(a) da * 


In the interests of simplicity we have idealized the processor in that we 
assume a continuum of faults and an integrable function, g(a). In a real pro- 
cessor the number of faults is finite and g(a) is actually a discrete function 
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Now. let 


0 = 3q g a.| < ag < ... < a^ = 1 

be a partition of the interval 0 < a < 1 
and define 

da,. . a,. - a^.T 

= g(a,-) dai. ^ 

We note that 

dX. = probability of occurrence of all faults which yield a value of "a" 
on the interval a.j < a < a^. + da.. 

If da^ is sufficiently small we can assume that the faults corresponding to 
the interval (a., a^. + da^. ) can be represented by a single probability of occur- 
rence, a^. . As a consequence, the Urn Model describes the latency distribution 

of faults on these intervals. Thus, the latency distribution over the entire 
interval is 

Pi “ X * X ^ 


Pz " X ^ a^. (1 - a.) dX. 


5 } 


Pn=X ^ 


^3 

^n+1 * X ^ ^ 

n+i k = n+1 • 
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If we replace dX^ by g(a.) da^ and pass to the limit we obtain 

1 


Pi • — + X 


P2 “ — \ j a (1 - a) g(a) da 


6 ) 


Pr, * 


T ® 


a)"’’’ g(a) da. n = 2,3,4. 


2 

X 


o , _ 

Pn+1 ’ ’ ^ 


k = n+1 


Pk . 


Again, we note that 


%+l Pk " 


g ( a ) da » 1 . 

^1 


Let us represent ^6) using the quantities P and Pq. 
As before 

X^ X« 

^o‘T*T- 

If we define i as the average value of "a" defined by 
1 

1 


a = 


^1 


a g(a) da 


then 

X« X, _ 

8) P Pq ■ X ^ T * 
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Note the similarity with (2). Solving (7) and (8) for gives 
X, _ (1 - P) Pq 
1 - a • 

Substituting this into (6) gives 


Pi = P P, 


9) p« = 


n - p) p, 


1 


1 - a J 
0 


•^a (1 - a) g(a) da 


(1 - P) P, 


1 


Pn * 


1 - a J 
0 


a (1 - a)^**^ g(aj da. 


In the Urn Model we had 
1 _ 

^ a (1 - a)""^ g(a) da = a (1 - a)"‘^ 


since g(a) was a delta function at a = a. Substituting this into (9) and we 
obtain (3). 

Equations (6) along with g(a) define the latency distribution. Regarding 
this distribution we observe that it is monotonic non-increasing, i.e.. 


Pn ^ Pn+i » ^ = 2,3,4,... 

This follows because 

a (1 - a)*^"^ g(a) > a (1 - g(a;. 
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9.2.2 Exampl es 

We illustrate the model with several examples. 
Example #1 

If g(a) is a delta function of magnitude , i.e., 
g(a) = X.| 6 (a - a) 


then equations (6) reduce to those of the Urn Model. 
Thus , 

^2 - ^1 

Pi " T " X 

10) = a (1 - a)"“^ -1 , n = 2,3,4,... 


Example #2 

Let g(a) = X-j for 0 < a < 1. 
Then, from (6), 


11 ) 




1 

n (n + 1 ) 



n 


2,3,4 


, . . ■ 


Pn+1 



1 

FXT 


^1 

n = 1 ,2,3, . . . 
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Example #3 

Let g(a) = 2 a for 0 < a < 1. 
Then, from (6), 


Pi = 



4 


1 -z-a 


X 


1 


X 


4 

Pn " n (n + 1 J (n + 2) T ’ * 2,3,4,... 

^3 2 ^1 
%+^ " X (n + 1 ) (n + 2} X • P ’ 1 .2 ,3 ... . 


Example #4 

Let g(a) = 2 X^ (1 - a), 0 < a < 1. 
Then, from (6), 



13) 


Pn (n + 1 ) (n + 2) X ’ " “ 2,3,4,... 


^n+1 


1 + 2 


^1 


n + 2 X * ” ^ 1,2,3,... 

9.2.3 Comparison of Models 


It was stated previously that the correlation between the Urn Model and 
the empirical distributions ’was more than a coincidence. We illustrate by 
assuming that Example #4 depicts the actual processor with 



. 4 . 
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In this case 
P-j = .4 


P 2 = 

.05 

P3 ' 

.03 

P 4 • 

.02 

It 

in 

£X 

.0143 

Pe ' 

.0107 

P; = 

.0083 

II 

00 

Q. 

.0066 

qg = 

.46 . 


Now let us fit the Urn Model to this distribution 
and a' such that the Urn Model agrees with , and p 
easily shown to be 


^1 

. We choose » 

A 

and pg. The resul 


X 

t is 


^1 ^2 ^*5 

X~ ” -2238, - .3246, = .4516, a = .337, 

If these values are substituted Into (3) we obtain 
Pi = .4 
P 2 = .05 
P 3 = .033 

P4 « .022 

15) ^ 

pg ® .0146 
Pg * .0097 
Py * .0064 
Pq * .0042 
qg = .46 . 


.5484, P .73. 
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Comparing (14) and (15) shows that the differences between the true and Urn 
Model distributions are small. Clearly, it would require a very accurate statis- 
tical analysis to distinguish between the two distributions. Vie note, inciden- 
tally, that if the Urn Model had been used to estimate the error would have 

been 25.4%. 
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10.0 STATISTICAL ANALYSES 


10.1 Introduction 

As indicated previously, the principal objective of the study is to obtain 
esitmates of fault coverage and fault latency in a typical avionics miniproces- 
sor. Although the statistical experiments were carefully designed to yield high 
accuracy and confidence for the least cost the estimates should not be taken too 
literally. The reader is advised to exercise engineering judgement in interpret- 
ing the results especially when inferring conclusions that depend upon small 
differences in the estimates. The reason for caution is the uncertainty in the 
assumptions underlying the study - assumptions which may, if incorrect or inac- 
curate, contribute a far greater uncertainty to the results than the statisti- 
cal analysis would imply. 

For the record, the critical assumptions of the study are: 

t From the standpoint of failure modes and effects every device can be rep- 
resented by the manufacturer-supplied gate-level, equivalent circuit. 

• Every fault can be represented as either a S-a-0 or S-a-1 at a gate node. 

• The failure rate of each device is equally distributed over the gates of 
the gate-level equivalent circuit. 

• The failure rate of each gate is equally distributed over the nodes of 
the gate. 

• Memory failures are exclusively faults of single bits. 

The assumption that S-a-0 and S-a-1 faults are equally likely is not criti- 
cal since the experiments were conducted in such a way the the results can 
easily be modified to reflect a change in this assumption. 

Until additional data becomes avai 1 abl e the effects of these assumptions on 
the estimates cannot be properly assessed. In the interim it must be said that 
the results only pertain to a conjectured realization of the processor. 

The statistical experiments (i.e., the number and distribution of faults) 
are designed to extract as much information as practicable from each experiment 
for a given set of faults. Thus, in addition to the principal fault coverage 
and fault latency estimates, it was considered desirable to obtain these esti- 
mates for each of the six partitions of the processor. These would provide a 
basis for more detailed analysis of the fault detection process, and would iden- 
tify components according to their failure detection coverage. Such data, even 
if unused in the present study-, would be available for future studies. 
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10.2 Estimators for Self-Test Coverage 


The estimators for x, y and z are 

1 ) X* = 

m 

2) y* = !ld 

n 

3) z* = '"d * "d 

m + n 

where 

X, y, z = probability that a S-a-0, S-a-1 , combined fault is detected; 
"™d ’ ^d ” number of S-a-0, S-a-1 faults detected; 
m,n » number of S-a-0, S-a-1 faults injected. 

A more accurate estimate of z can be obtained if stratified sampling is 
employed. For example, let 

a = proportion of S-a-0 faults in the fault set of the processor 

A 

a^ = proportion of S-a-1 faults in the fault set of the processor 

where a + a * 1 . 

A y 

If m and n are selected such that 
m = a^^ n, n = a^ n 

where 

n = total number of faults injected, 

then 

z* = a^^ X* + a^ y* 

is more accurate than (3) if x f y. Although stratified sampling was not inten 
tionally employed in the study' the actual selection resulted in an almost equal 
number of S-a-o and S-a-1 faults. (*) 

* In the selection process a^ = a^ = 0.5, i.e., S-a-0 and S-a-1 faults were 
equally likely. 
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10.3 Estimators for Latency 


The estimators for Xj^, and Zj^ are 


X * = \ 

m 


4 ) 


v = ^ 


2,,* » % + "k . k = 1.2,3,..,, 8. 
m + n 


Where 


X|^, yj^, Z|^ = probability that a S-a-0, S-a-1, combined fault is 

detected in the k-th repetition; 


*”k’ ^k * number of S-a-0, S-a-1 faults detected in the k-th repetition. 

With some abuse of terminology we define 

Xg , yg* Zg = probability that a S-a-0, S-a-1, combined fault is not 
detected in the previous 8 repetitions. 

We note that Xg corresponds to of Section 9. The estimators for Xg, y^ 
and Zg are 



5) yg* = " - "l - "2 ' ~ "S = 1 - y^* - y^* - 

n 


- ^8 


* 


m Xg* + n yg* ^ ^ 

m + n 


— 2 * — 


— 2 * 
Zg 
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10.3.1 Corrections for Indistinguishable Faults 


The occupancy probabilities Xj^, 


assume that indistinguishable faults 


have been disqualified. As indicated previously, 
a certain proportion of indistinguishable faults, 

present the occupancy probabilities are y^j^. 


the fault set set will contain 
y. When such faults are 

where 


1 

y k 
1 

^ k 
1 

X 9 
1 

y 9 
1 

" 9 


X|^ (1 -Y) 

(1 -y) 

2 ;^ 0 - y ), k * 1 , 2,. ...8 

Xg (1 -y) + Y 

Yg (1 -y) + Y 

Zg (1 -Y) + Y, 


assuming that indistinguishable faults are uniformly distributed over S-a-0 
and S-a-1 faults. Since indistinguishable faults were not disqualified in 
the Phase I experiments the estimates actually obtained are those of 

X 1 ^, y ^ and z 1 ^. 


10.4 Estimators for Urn Model Parameters 

The method of estimation will be described for S-a-0 latency distributions . 
With an obvious change in parameters, e.g., the estimates can be applied to 

S-a-1 and combined latency distributions, as well. 

The method is based on the principal of maximum likelihood. We note that 
mj^ S-a -0 faults are detected in the k-th repetition. Accordingly, we seek Urn 

Model parameters a, P and P^ that maximize the likelihood function 

(Hm nig ni- 

>- • Pi ?2 Ps qg 
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where 


Pi • Pq ’’ 

Pg - (1 - P) a Pg 

P 3 = (1 - P) a Pg n - a) 

7) 

Pg » (1 -P) a Pg (1 - a)® 
qg = Qq + (1 - P) Pg (1 - a)^ 
end “ rn “ ™ ITI2 * • • • ~ 

(See Section 9.1 for a definition of the Urn Model). 

The maximum likelihood estimators for a, P and Pq are obtained as the 
solution of 


8 ) 


iL - n 2L 

9a " 3P 



It can be shown that the solution to the pL/9Pq - 0 equation is: 



The solution to the 9L/9P * 0 equation is: 


9) P* 
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Solving the9L/3a ■ 0 equation for the quantity (1 - a*) yields the 
following equation: 

A (1 - a*)® + B (1 - a*)^ + C (1 - a*) + D = 0 

where 

8 8 

A = -8 Z nij + Z Im^ + 7 m. 

1*1^ 1 » 1 ’ ^ 

8 8 

B - 9 Z m. - Z 1m. - 8 m. 

1*1^ 1=1 ’ ^ 

8 8 
C = Z m. - Z 1m_. 

1 = 1 ’ 1=1 ^ 

8 8 

D = -2 Z m. + Z 1m. + m, 

1 = 1^ 1 » 1 ’ ' ■ 


The roots of this equation are determined from a root solving routine, and 
substituted Into (8) and (9) to obtain Pq* and P*. 

10.5 Accuracy and Confidence of Coverage Estimates 
It can be shown (ref. 2) that 
10) E (X*) = X, E (y*) = y, E (z*) = z 
and 


n) E ( (y - y*)® ) = ^ 

E ((z - z*)®) = ^ 

where 

E (•) = expected value of (•)• 
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For m, n and N sufficiently large the estimators x*, y* and z* are approximately 
Gaussian with means and variances given by (10) and (11), respectively. 

The following derivation of accuracy and confidence is general and applies 
to any quantity, x, estimated by the method of Section 10.2. As before, 

X* * estimate of x 
m « sample size. 

It is well-known (See (ref. 3), for example) that the probability 
that X lies between the limits 


m 

m + 


X* + 1 ^ / X* (1 - X*) + 


m 


or, equivalently, that x* lies between the limits 


12 ) 

is equal to Y, where Y is the area of the standard Gaussian distribution 
between -X and X. From (11) we may say that the error in the estimate, x*, is 



with a confidence level of y. 

Equation (13) is an ellipse in x. Table 22gives a tabulation of 
zflfT versus x for a confidence level of y * .95, 

It is often convenient to obtain error estimates that are independent of x. 
From (13) it can be seen that the maximum error occurs when x = 1/2. Table 23 
gives a tabulation of this maximum error versus sample size and confidence lev- 
el. It is noted that the maximum error can be extremely conservative. 
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10.6 Accuracy and Confidence of Latency Estimates 


In this section we derive accuracy and confidence levels for S-a-0 latency 
estimates. Again, the results are easily extrapolated to S-a-1 and combined 
estimates . 


It is shown (ref. 3), pg 214^ that 


14) E (Xj^*) * Xj^ 


15) E ( (Xt - X *)^ ) = ~ 


m 


16) E ( (x-. - X *) (X,. - X.*) ) = ■ ^ , 1 j 

* ^ m 

for 1,j, k»l,2, 3, 

From (14) and (15) it can be seen that the accuracy and confidence level of 
a single estimate, x*j^, is identical to that obtained for the coverage estimate 

X* in the previous section. 

To obtain a measure of "goodness of fit” for the entire distribution we 
observe that, for m sufficiently large, the variable 

17) " £ 7^ ^='k ■ 

is distributed in a chi-square distribution with 8 degrees-of- freedom (see 

Ref (3), page 419). 

2 2 
If X i.y denotes the 1 - y level of x * i.e., 

= Y 

then the probability that a point, (x^*, X 2 *,...^Xg*) lies inside the 
ellipsoid 


2^2 
X iXuy 
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18J 


2 

X l.y 


9 

Z 


k ■ 




2 


is equal to y. 

In principle, we can obtain error bounds for the estimates, X|^*, from (18) 
There are two reasons why we do not do so: 

1) A chi-square fit generally requires that m ^ 10 for all Xj^, a'condi 
"tion that is not satisfied for the latency cells k=4,5,6,7,8. 

2) The Phase I experiments indicate that the latency distributions are 
concentrated at the 1st and 9th cells, the other cells contributing 
less than 10% to the total. Thus, the occupancy probabilities of the 
1st and 9th cells are the most significant. 

If we group the intermediate cells into a single cell and denote the occu- 
pancy probability by p and its estimate by p* then, for m sufficiently large, 
the variable 

- jrj" (x, - x^*)^ jr (xg - * t (p - p*)^ 

is chi-square distributed with 2 degrees-of-freedom. We note that 

P = 1 - - Xg 

P* - 1 - - Xg* 

Equation (11) represents a skewed ellipse in the plane of x-j*, Xg*. 


We can simplify the error estimates by observing that the ellipse of 
(19) lies inside of the ellipse 

20 ) ^ ^ ^’'9 ■ ’' 9 *'^ • 

From (20) we conclude that the errors (x^ - x^*), (xg - Xg*) simultane 
ously lie on the intervals 
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21 ) - < (x^. - x^*) < e,^ 

where 

22) = J ^ 1-Y . k a 1 ,9 

with a probability not less than y. 

It is interesting to compare these errors if the Xj^ were tested indepen 
dently. In this case the errors are 

23) = X I \ ~ ,k«l,9. 

If we select y * -95 (95% level) then 

X * 1.96, 05 * reduce to 


24) 


^k * 


£|^ = 1.96 (1 - Xj^) , respectively. 

10.7 Accuracy and Confidence of URN Model Parameter Estimates 


Let p-j denote the probability that a fault is detected in the i-th repeti- 
tion, i = 1, 2, 8 and the probability that a fault is not detected in 

the previous 8 repetitions. Then the multinomial sampling distribution is 


Ui Un 1 ^ 

25) f = Pj ... Pg 


3 *^9 

'^9 


where the p.^ and are defined as in Section 10.4 and the points u 
..., Ug) are taken from the set 


(y-j . ^2 
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(It0s..«y0) 

( 0 , 1 , 0 ,. « . , 0 ) 


(o,o,...,o,i). 

We note that 
8 

2 P4 + qq = 1 . 

i « 1 ^ ^ 

In order to obtain error bounds for the Urn Model parameter estimates 
we Invoke a theorem from (Ref, 2), page 212: 

Theorem 

The maximum likelihood estimators 0^*, 02*, 83* for the sampling distri- 
bution f(u, O"! , ^2* samples of size n are, for large samples, approxi 

mately distributed by the multivariate Gaussian distribution with means 0-| , 

02* 83 and variances and covariances, a.y where is the inverse of 

the matrix whose elements are 


26) G 


ij 


- n E 


r log f 

L^®i j 


. 1 . j * 1 ,2,3. 


When applied to the Urn Model 
0^ » P 


* ^0 


83 » a 


and 6-|*i ®2** ®3** estimated values of 0^, 02 and 02» respectively. 
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III! 


The elements of the covariance matrix ||a. .|1 were tabulated for the Phase I 

■ J 

experiments and the results are given in Table 16. 

A much simplified estimate of the errors can be obtained by employing an 
approximation that was suggested in (ref. 1), There, it was assumed that 



In other words, detectable faults are always detected in the first 8 repe- 
titions. From (7) this is equivalent to the approximation 

27) (1 - P) Pq (1 - a)^ = 0. 

If this substitution is made in the likelihood function, L, then the resultant 
estimates are, for $-a-0 faults, 


P * 


1 

m 


8 

I m. 
1 » 1 ^ 


P* 


28) 


m. 


8 

S m. 


i » 1 


a* 


8 

Z m . - m, 
i = 1 ^ ‘ 

Z im. - I m. 
1 > 1 ' 1 > 1 ^ 


The experimental results confirm the accuracy of those approximations* 
(see Table 15). More interesting, however, are the resultant error covarian- 
ces. When the approximation of (27) is made we obtain 


* At least for the distributions obtained in the study. 
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29 ) 


E(IP - P*)^)= 

«(Pq - 

E((a - a*)2)» 


P (1 - P) 

m Pq 

a Pg (1 - Pg) 
m 

n - a). 

m Pq (1 - P) 


and the cross-covariances vanish. Thus the estimates are independent and, at 
a confidence level of y, the errors are, for P, Pq, a, respectively. 


/ P_0 

- P) 

1 " 

Pq 

'/ ”0 

0 
o. 

1 

V m 


n - a) 


where X isas defined in SectionlO.5. As an indication of the error magnitudes, 
for a typical fit (See FETSTO results. Table 15), 

P ^ .781 
Pq ~ .383 
a ^ .464 
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which result In the following errors at a 95 % confidence level: 


£p * .041 (5.2%) 
e- * .030 (7.8%) 


* .073 (15.7%). 


The reader is reminded that these estimates are only valid 
correctly represents the distribution. The example at the 
illustrates the uncertainty in estimated variables when an 
used. 


if the Urn Model 
end of Section 9 
incorrect model is 
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TABLE 22 


Error Ellipse for a Confidence Level of y * .95 


e 




X (1 - X) 


e ^ m 

X 

0.0 

0 

.427 

,05 

.588 

.1 

.70 

.15 

.784 

.2 

.849 

.25 

.898 

.3 

.935 

,35 

.960 

.4 

.975 

.45 

.98 

.5 

.975 

.55 

.96 

.6 

.935 

.65 

.898 

.7 

.849 

.75 

.784 

.8 

.7 

.85 

.588 

.9 

.427 

.95 

0.0 

1.0 
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SAMPLE 

SIZE 

CONFIDENCfe\. 

LEVEL 

200 

300 

400 

600 

1000 

.6 

.03 

.025 

.021 

.017 

.013 

.7 

.037 

.03 

.026 

.021 

.017 

.8 

.046 

.038 

.033 

.027 

.021 

.9 

.058 

.048 

.041 

.034 

.026 

.95 

.069 

.056 

.049 

.04 

.031 

































n.O EMULATION DESCRIPTION 


n .1 BDX-930 Architecture 

The BDX-930 Digital Processor is a microprogrammed, pipelined machine de- 
signed around the AMD2901A four bit microprocessor slice. The machine contains 
sixteen general purpose registers of which four registers may be loaded direct- 
ly from memory and two registers may be used as base registers. One register is 
used as a stack pointer. 

The program counter and memory address register are contained in the 9407, 
a chip designed to perform memory address arithmetic. Along with a temporary 
register contained on the same chip, the BDX-930 is able to perform four basic 
addressing modes involving three registers and various instruction fields. 

The machine contains three memory interface data registers which are used 
to input and output memory data. There are also a number of one bit status flag 
registers that can be manipulated under program control. This includes the FI 
and F2 registers, which are hardware flags, and the interrupt enable, overflow 
status registers. There also exist the indirect and link registers used by the 
microcode for branching. 

The microcode is contained in seven proms and a pipeline register is in- 
cluded for simultaneous microcode fetch and decoding. Various internal and 
external conditions can affect microcode branching as selected by the microcode 
itself and a microcode control prom. In addition to a rich instruction set which 
includes 16 and 32 bit fixed point operations, there is a test set interface in 
the microcode. A selectable saturate mode is available which limits the results 
of arithmetic operations when overflow or underflow occur. 

For simulation purposes, the computer has been divided into six partitions, 
consisting of the following principal devices: 

Partition 1 - Address Processor 

• 4 - 9407 Memory Address Processor Equivalent Circuit 

• Selector Chips to Multiplex Memory Address Source 
• 4 - 54LS352 4:1 

t 2 - 54LS158 2:1 
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Partition 2 - Data and Status Registers 

• 2 - 54LS374 Memory Input Buffer Register 

• 2 - 54LS374 Memory Output Buffer Register 

• 2 - 54LS374 Next Instruction Register 

• 3 - 5415113 Single Bit Registers for 
f overflow 

• indirect addressing 

• link (bit carry for divide) 
t interrupt mode 

• FI and F2 

• 2 - 54LS153 Select Overflow, Link, and Indirect Bit Sources. 

• 2 - 54LS245 octal bus transceivers 
Partition 3 - Microcontrol ler 

• Pipeline Register 

§ 4 - 54LS273 octal latch 

• 4 - 54LS175 quad latch 

• 1 - 54LS374 octal latch with tri-state 

f 1 - 54LS273 External Signal Synchronizer 

• 3 - 54LS151 Selectors 8:1 for Branch Conditions 

• 1 - 54LS169 Counter for Shift and Multiply Instructions 

t 1 - 54LS169 Counter for Multiple Register Load-Store Instructions 

• 1 - 54LS377 Instruction Register 

• 1 - 54LS253 Microcode Branch Selector 

Partition 4 - Execute 

• 4 - AMD2901A 4 Bit Slice ALU 

t 1 - AMD2902 Lookahead Carry 

• 2 - 54LS153 Selector 4:1 Register Selectors 

• 1 - 54LS253 Selector 4:1 Shift Bit Selector 

Partition 5 - Microcode 

• 7 - 54S472 Proms with 56 Bit Wide Microcode 
Partition 6 - Control Proms 

• 1 - 54S472 Prom Microcode Start Address for Macroinstructions 

• 1 - 54S288 Prom Control for Microcode Branch 
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Instruction execution is accomplished by a pipelined architecture; various 
stages of execution occur simultaneously for a sequence of instructions. 

Consider, for instance, four instructions, A,B,C,D, to be executed in sequence. 
During the same clock cycle it is possible for the program counter to be incre- 
mented to point to instruction D, while instruction C is being fetched, instruc- 
tion B is being decoded and instruction A is being executed. 

With this level of parellelism, it will be noted that when the execution 
phase of an instruction is one clock cycle, the average time to perform the en- 
tire instruction will be one clock cycle. This relation can better be under- 
stood by referring to Table 24. 

It should also be noted that the partitioning of the BOX-930 is roughly 
broken up into the stages of the pipe: - address, fetch, decode, and execute. 
These stages of the pipe are joined by various buses throughout the CPU. These 
buses are formed from tri-state logic and some are bidirectional. An enumera- 
tion of the major buses includes 

• Y - Connects the output of the ALU (AMD2901A) to the address processor 
and the output register. In addition, it connects the output of the next 
instruction buffer to the start address register and instruction register. 

• D - Connects the memory data register and the program counter to the in- 
put of the ALU. 

t DAT - Bidirectional bus connecting memory and I/O to the memory data 
register and output register. 

t M - Bidirectional memory data bus 

t MAR - Memory Address Bus 

• U - Microcode Bus 

• IR - Instruction Register 

A list of the devices used in the BDX-930 and their failure rates is given 
in Table 25. The data was obtained from MIL-HDBKl 27B , Notice 2. 

11.2 Description of the Emulator 

The emulation includes the components of the CPU (Central Processor Unit), 
scratchpad memory and those portions of the program memory containing the six 
target programs and the target self-test program. The emulation is derived 
from the circuit schematics. Each device is represented by a gate-level equiv- 
alent circuit supplied by the chip manufacturer. It was found that six types of 
gates were sufficient to represent any device, e.g., NAND, AND, OR, NOT, NOR, 
EXCLUSIVE OR. Table 26 gives the number of equivalent gates in each device of 
the CPU. In all, 5,100 gates were required. In the interests of reducing exe- 
cution time, it was not expedient to emulate all components at the gate-level. 

The following elements are represented at the functional -1 evel : 

program memory 

scratchpad memory 

microprogram and control memories 

16 general purpose arithmetic registers. 
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The emulation did not include the direct memory access unit (DMA) or any of 
the devices of the I/O. The emulated devices of the CPU are shown in Figure 13. 

Faults were injected into all devices except the program and scratchpad 
memories. Because the program memory is "read-only", no processor, faulted or 
not, is permitted to write into this memory. However, even though the scratch- 
pad memory is never faulted, a faulty processor can write into it. As a conse- 
quence, in the parallel mode of operation where 36 processors are simultaneously 
emulated, the corresponding 36 scratchpad memories are also emulated. 

No delay has been simulated between logic gates. It is assumed that all 
combinational logic is stable at the output the instant an input pattern is ap- 
plied to it. This means that each time the input is changed, the network need 
only be evaluated once to supply the correct output pattern. Operating in this 
manner is very time efficient, but puts stringent requirements on the order of 
evaluation of the gates. To be able to meet these requirements, the logic is 
levelized, i.e., placed in groups or levels that represent the proper order of 
eval uation . 

The emulator utilizes the parallel method of logic simulation (see, for 
instance, Seshu and Freeman (ref. 5), or Hardie and Suhocki (ref. 6)). The data 
word of a PDP-10 contains 36 bits; each bit position is used to represent a dif- 
ferent machine. The simplest gate operations are represented by a single Boolean 
instruction; when the two inputs occupy the same bit positions in their respec- 
tive words, the output also occupies this bit position. The advantage of this 
technique is execution time savings. Typically, the amount of code necessary 
to simulate 36 machines is of the same order as the amount of code necessary to 
simulate only one machine. The BDX-930 description is contained in compiled 
code, rather than in tables, which was also done for speed. 

Certain portions of the machine, notably the memory elements, were repre- 
sented at a functional level rather than a gate level. For microprogram memory, 
two words of PDP-10 storage contain 56 bits of microstore; at micro memory fetch 
time, these bits are retrieved from the proper address for each of the simulated 
machines and combined to form suitable words to interface the gate portion of 
the emulation. The ROM portion of main memory is handled in the same manner. 
Writable store contains a routine to translate the gate inputs into consecutive 
PDP-10 storage words so that there is one copy of writable storage for each 
machine being emulated. On reading this storage, the process is reversed. 

In a typical run of the emulator, 36 different machines are exercised; 35 
faulted machines and one good machine. Each faulted machine is assumed to have 
a single solid fault at one node, either stuck-at-one (S-a-1) or stuck-at-zero 
(S-a-0). The faults are injected by defining extra gates at each node, an AND 
gate for stuck at zero and an OR gate for stuck at one. A typical AND gate 
using this technique is shown in Figure 14. 
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To demonstrate the use of this technique for injecting and emulating the 
propagation of gate-level faults, refer to Figure 15. In the figure, a typical 
gate-level operation is shown involving four gates. Logic is level ized in terms 
of two levels. Let us assume that the input has the value '10' on it, and we 
would like to simulate 6 faults (S-a-0 at leads 3,4, and 6 and S-a-1 at leads 
3,4, and 5) in the circuit. The first step would be to define two PDP-10 com- 
puter words to represent the inputs at- each lead. Bit position 0 would repre- 
sent the unfaulted machine while positions 1-6 would represent the faulted 
ones. Next, fault words are defined for S-a-1 and S-a-0 faults at each node 
and each node is assigned a word to contain the results of its operation. 

First, the input faults are applied to the input words yielding words (1) and 
(2). Then, the two buffer operations are performed. Buffer output faults are 
applied yielding words (3) and (4). Note that (3) and (4) occupy the same phy- 
sical storage as (1) and (2) yielding a memory efficient algorithm. The second 
level is evaluated in much the same manner, yielding the results (5) and (6). 
Table 27 shows the value of the nodes for Figure 15. 

An additional reduction in run-time can be achieved by observing that not 
all gate faults are distinguishable at the gate output. For example, a S-a-0 
fault on the input node of an AND gate is indistinguishable from a S-a-1 fault 
on the output node. As a consequence, if two or more indistinguishable faults 
of the same gate are selected, only one fault will be emulated. 

It will be noted that only one partition of the BDX-930 runs with faults 
injected in each simulated run. The remaining partitions run 'true value', that 
is, logic without fault injection capabilities. This results in a time saving 
in program execution. When the entire emulator is run true-value, the execution 
ratio between PDP-10 time and simulated time is 21,000:1, with faults injected 
in one partition, this number is approximately 25,000:1. In order to achieve 
these ratios, a number of problems had to be solved. 

Stabil i zation 


The propagation of logic signals through a combinational logic network in- 
volves many concurrent paths of travel; the value at the output of any particu- 
lar gate is only stable after a certain interval. The inherently sequential ex- 
ecution of a computer program presents a potential problem as to the order of 
evaluation of gates. One approach to parallel operation in a sequential emulation 
is, during each BDX-930 clock time, to evaluate the gates repeatedly until re- 
evaluation produces no further change in state. It is desired to minimize pro- 
gram execution time; therefore the number of times each logic gate is evaluated 
should be minimized. If a particular sub-circuit is free of memory elements and 
feedback paths, it need be evaluated only once. The order of analysis here is 
critical, but not necessarily unique. Feedback elements represent a special 
problem. For a simple R-S type flip flop, the proper output states can be 
ascertained by evaluating each element, at most twice. 

D-Latches 


The edge-triggered D latches represent a much harder circuit to model. 
The circuit diagram of such a latch is shown in Figure 16. 
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Operation of these circuits is dependent upon receiving both the low and high 
levels of the clock signal to trigger the latch. All of the combinational logic 
in the BOX -930 requires only one evaluation per clock cycle. In the interest of 
reducing execution time, the D latch in Figurel6awas replaced by the latch in 
Figure IGbwhich is functionally equivalent, but requires only one evaluation 
per clock cycle. 

Tri-State Buses 


In order to evaluate tri -state buses, it is necessary to replace them with 
a gate equivalent circuit. Such an equivalent circuit is easy to synthesize in 
the case of a wired-OR type circuit, but tri-state logic also may fail to a high 
impedance state. In this case, any change on the line is particularly sluggish. 
The last failure-free output on a tri-state line will exponentially approach the 
high state as a function of wiring capacitance and other circuit parameters at 
the time of the failure. This failure mode was not simulated; tri-state failure 
modes are considered the same as wired-OR failure modes. The justification is 
to avoid a failure mode that is random, and dependent on the past history of the 
gate. The equivalent circuit is shown in Figure 17. 

Serial -to-Paral 1 el /Paral 1 el -to-Serial 


In a gate-level emulation each node is represented by a single word of the 
host computer whereas, in a register-level emulation, a collection of nodes is so 
represented. For example, a set of 16 nodes might represent the address bits of 
memory data which, at the register-level, would be represented by a single, 16 
bit word in the host computer. In an emulation which contains both gate-level 
and register-1 evel components, it is necessary, in passing from one level to 
another, to convert the host computer words to compatible formats. 

In the present emulation the following memory devices are represented at 
the register-level: 

microprogram (7, '54S472, 512 x 8 proms) 

microprogram control (54S288, 32 x 8 proms) 

macroinstruction start address (54S472, 512 x 8 proms) 
main memory 

The first 3 components are read-only proms which require a conversion of 
the address nodes to register-level and data to gate-level. The main memory is 
both ROM and RAM and this requires a conversion of both the address and data 
nodes to register-level in the write cycle and address nodes to register-level 
and data to gate-level in the read cycle. 

For each of the above elements, two conversions are required for each pass. 
Because of the quantity of such elements and the frequency of passage it is 
essential to implement a real-time efficient conversion algorithm. 

We assume a parallel fault emulation with each node represented by a 36-bit 
word of the host computer. 
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A collection of m nodes is represented by the set of words 
, a^2* •••* ,36^ 

^2 ^^21 * ^ 22 * • • * * ^ 2 , 36 ^ 

% " ^^ml» %2 ®m,36^ 


Assume that the corresponding register-level data are represented by the words 

B-| — (a-|^, ^21* •••* ^iTil * 

^2 “ ^^12* ^22* ’*** ^m2 * ***^ 

^36 ^ ^^1,36* ®2,36 •**’ ®m,36* 
where the x's represent unused bits, with m _< 36. 

The resultant conversion algorithms are shown in Figures 18a, 18b. 


11.3 Preprocessor and Postprocessor 

The host processor for this emulation is a Digital Equipment Corporation 
model PDPIO. This is a 36 bit machine built around 1970 taking about 1.5 to 3 
microseconds for an integer add instruction, depending on addressing mode. The 
processor is located at Carnegie-Mel Ion University in Pittsburgh, Pennsylvania 
and was accessed via telephone lines. This machine supports time sharing for 
university and research projects in which the university participates. 


The emulation was written in the BLISS language. BLISS was developed by 
DEC for its own system programming efforts, and is well suited for our applica- 
tions. It allows bit manipulation in a very efficient manner while maintaining 
many of the structures of a higher order language. The macro facility proved 
invaluable in developing an efficient simulation. 

The emulation process consists of running three separate programs, a pre- 
procesor, the emulation proper, and a postprocessor. The function of the pre- 
processor is to select a random set of faults for emulation while the postpro- 
cessor interprets and prints the results. 

The preprocessor reads in a list of components in the BDX-930 and their 
failure rates. The program then queries for the number of faults to be run, and 
faults are selected in the manner described in Section 3. Faults are broken up 
into sets for each partition, and further into groups of 35 or less for each 
emulator run. All pertinent information is written out onto disk for processing 
by the emulation phase. 

The emulation phase runs the BDX-930 emulator program repetitively until 
all groups of 35 faults have been processed. After each simulated run, the 
results are written out to disk for later processing. 
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The postprocessor takes the results of the emulator runs and prints a 
table. It is first decided whether or not the fault was detected, and if so, 
during which iteration or step of the test program detection occurred. The 
exact location of the fault is determined, and all this information is displayed 
for the information of the analyst. Cumulative statistics are also computed. 

The true-value emulator was subsequently verified by single stepping 
through a self-test program consisting of 2000 executable instructions. The 
self-test was designed to exercise every instruction type in the instruction 
repertoire of the BDX-930. This is essentially the same procedure used to 
validate the hardware version of the processor. 

11.4 Typical Circuit Representations 

Some typical representations of components are shown in Figures 19 thru 22. 
Each of these diagrams represents a single integrated circuit chip, which 
is coded in BLISS as a subroutine. In turn each partition consists of sub- 
routine calls that simulate a particular function of the CPU. 

11.5 Summary of Emulation Characteristics 

A summary of emulation characteristics is given in Table 28. 
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TABLE 24 PARALLEL OPERATION OF THE BDX-930 PROCESSOR 
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TABLE 25 COMPONENTS OF THE BDX-930 CPU 


DEVICE 

FAILURE RATE/PER 
UNIT 
(PPMH) 

9407 

1 .3431 

2901A 

2.1656 

2902 

0.3898 

5440 

0.0653 

54125 

0.0855 


54S00 

0.0855 

54S04 

0.1003 

54S10 

0.0764 

54S20 

0.0654 

54S32 

0.2138 

54S288 (32x8 prom) 

0.1787 

54S472 (512x8 proms) 

1 .009 

54LS00 

0.084 

54LS02 

0.084 

54LS04 

54LS08 

0.0983 

54LS11 

0.0752 

0.084 

54LS86 

0.084 

54L$113 

0.1447 

54LS151 

0.1483 

54LS153 

0.1447 

54LS158 

0.1410 

54LS169 

0.6603 

54LS175 

0.1703 

54LS245 

0.3792 

54LS253 

0.1447 

0.1636 

54LS273 

0.6882 

0.2681 

54LS352 

0.3117 

54LS367 

0.1100 

54LS374 

0.7234 

54LS377 

0.7148 



TABLE 26 


MICROCIRCUITS 

AND EQUIVALENT GATE COUNT 

DEVICE 

EQUIVALENT GATES 

2901 A 

798 

2902 

19 

54113 

8 

54151 

17 

54153 

16 

54158 

15 

54169 

58 

54175 

22 

54245 

18 

54253 

16 

54273 

34 

54352 

16 

54374 

26 

54377 

35 

9407 

143 
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COMPUTER CONTROL UNIT 


ARITHMETIC PROCESSING UNIT (APU) 


PROGRAM 

CONTROL 
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CONTROL 
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^ FAULTED COMPONENT 


P'1008 


FIGURE 13. PROCESSOR ARCHITECTURE 
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TABLE 27 VALUE OF NODES 


A ^ B 

S S S S S T 
a a • a a R 
1 0 1 1 0 U 
3 4 4 5 6 E 


1111111 


0 0 0 0 0 0 0 


111111 


0 0 1 0 0 0 


3- 1 AND 1 

4- 2 AND 2 

5- 3 AND 4 
6 = NOT 4 






CLEAR 







EXIT 


NOTE: = logical '’AND”, " + ” = logical "OR” 

FIGURE 18a GATE-LEVEL TO REGISTER-LEVEL CONVERSION ALGORITHM 
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STROBE 7 











FIGURE21 1C 153 
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figure 22 1C 158 
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TABLE 28 EMULATOR CHARACTERISTICS 


GATE-LEVEL EMULATOR 

. CODED IN BLISS 

. BDX-930 CPU 0 GATE LEVEL 

{A FORTIORI, AT COMPONENT-LEVEL) 

. MICROPROGRAM MEMORY PROMS {7, 512x8) 

. MICROPORGRAM CONTROL PROM (1 , 32x8) 

. MACROINSTRUCTION START ADDRESS PROM (1. 512x8) 

. MAIN PROGRAM MEMORY (ROM) 0 REGISTER-LEVEL 

. SCRATCHPAD MEMORY (RAM) 0 REGISTER-LEVEL 

. 36 CPU'S EMULATED IN PARALLEL 

. 36 SCRATCHPAD MEMORIES EMULATED IN PARALLEL 

. 1 MAIN MEMORY EMULATED AND SHARED BY 36 COPIES 

. 5100 GATES 0 4 NODES/ GATE, AVERAGE 

. 33,024 BITS OF PROM 0 1 NODE/BIT 

. S-A-O, S-A-1 FAULTS 0 EACH GATE NODE OR BIT 

. NO FAULTS IN SCRATCHPAD MEMORIES 

. NO FAULTS IN MAIN MEMORY 

. 67,256 FAULTS POSSIBLE 

. TRUE-VALUE REAL-TIME RATIO = 21,000:1 

. REAL-TIME RATIO WITH FAULTS = 25,000:1 
(PDP-10, CMUD COMPUTER) 


12.0 EXTENSION OF EMULATOR TO MULTIPROCESSOR SYSTEMS 


One of the objectives of the present program was to study and make recom- 
mendations on how the emulation could be utilized to perform fault injection 
experiments on the SIFT (Software Implemented Fault Tolerance) computer system 
which was developed by SRI International with Bendix, Flight Systems Division, 
as a major subcontractor. 

12.1 Description of SIFT (See (ref. 4)) 

SIFT is an ultra-reliable computer system that is designed for flight cri- 
tical aircraft control and avionics applications. It is based on a multipro- 
cessor architecture that achieves fault tolerance by replicating computing tasks 
among processing units. Error detection and system reconfiguration are per- 
formed by software. The SIFT system is shown in Figure 23 in a 7-processor 
configuration. A single processor is shown in Figure 24. 

Initially, each SIFT processor is assigned a set of software tasks. If a 
task is critical it will be redundantly executed by either three or five pro- 
cessors, depending upon the criticality of the task. Each processor executing 
a critical task inputs sensor data over a dedicated 1553A bus. This data is 
stored in memory and transmitted to the other processors over a high speed, 
serial, intercomputer data link which operates in a broadcast mode. By means 
of a selection algorithm each processor of the redundant set selects the same 
inputs, computes its assigned task and transmits the results to the other pro- 
cessors over the intercomputer data link. 

The results of each computation are compared in the local processors and 
any discrepancies are noted. When a faulted processor has been identified the 
processor is, thereafter, "ignored" by the other processors. The critical tasks 
are then redistributed amoung the remaining processors. 

Fault isolation and reconfiguration are the functions of the Global Execu- 
tive task which, because of its criticality, is also redundantly computed. The 
Global Executive reads the error reports from the local processors and attempts 
to identify the faulted processor. When the processor is identified the Global 
Executive computes a new distribution of tasks and informs the remaining pro- 
cessors of their new assignments. 

12.2 SIFT Emulator 

The above description of SIFT was given to orient the reader in the SIFT 
philosophy and to note that the fault isolation and reconfiguration tasks may be 
performed by processors other than tliose computing the application tasks. As 
we shall see presently, implementing this feature will result in an increase in 
run-time of the emulator. 
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In order to emulate the SIFT system it is necessary to extend the present 
emulation to include the I/O interface hardware, shown in Figure 24, e.g., 

t Transaction and data files 

• 1553A controller 

• Broadcast sequencer 

• Receiver sequencer 

• Broadcast bus 

The parallel mode of operation of the present emulator automatically accom- 
modates a multiprocessor system. Instead of using 36 bits to represent 36 ver- 
sions of the same processor the bits are subdivided into sets of 7, 7, 7, 7, 7, 

1 with each 7-bit set representing a single version of the SIFT system. The 
last bit is extraneous. Faults would then be injected into one of the 7 bits 
of each segment and the emulation would be run, exactly as in the present study. 
If faults are limited to a single processor and its program memory then it is 
only required to emulate the 6 memories of the non- faulted processors and the 
5 memories of the faulted processors. 

The Preprocessor and Post processor programs would have to be modified to 
reflect the new rules of fault injection and fault identification. 

In the SIFT emulation only 5 faults can be emulated in a single run as com- 
pared with 35 in the present study. This reduction was the result of emulating 
the processor in groups of 7. An apparently attractive alternative approach 
would be to emulate the 6, non-faulted processors and 30, faulted processors in 
a single run. The problem here is that the action of both the local and global 
processors may be different for different faults. As a consequence, it is nec- 
essary to emulate the entire SIFT system for each fault. The effect of emulat- 
ing 5 instead of 35 faults in a single run is a 7-fold increase in run-time of 
the emulator. 


12.3 SIFT Fault Injection Experiments 

Having established the essential features of the SIFT emulator we now con- 
sider some possible applications. 

Experiment #1 

Inject a fault and determine the time required to detect and isolate the 
fault and reconfigure the system. The data will consist of 3 latency distribu- 
tions for time of detection, isolation and reconfiguration. 

Experiment #2 


From Experiment #1 identify those faults which remain latent after several 
repetitions of the application task. These faults are likely to remain latent 
for long periods of time, making the detection or isolation of subsequent faults 
more difficult. 
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The experiment consists of injecting a latent fault in one processor and a ran- 
dom fault in another processor of the same redundant set and observing the time 
to detect, isolate and reconfigure. 

Experiment #3 

The results of Experiment #1 established the time frame for detection, 
isolation and reconfiguration. In this experiment the effect of a second fault 
in this time frame will be observed. The first fault is injected as in Experi- 
ment #1. A second fault is injected in a different processor of the redundant 
set at a randomly selected point in time within the time frame for reconfigura- 
tion. The subsequent detection, isolation and reconfiguration will be observed 

Each of these experiments would require a small modification to the Pre- 
processor program since they differ in the way faults are injected. In all 
experiments the emulator remains unchanged. 
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13.0 CONCLUSIONS 


On the basis of the study we conclude: 

• Emulation is a practicable approach to failure modes and effects analysis 
of a digital processor. 

f The run time of the emulated processor on a PDP-1 0 host computer is only 
20,000 to 25,000 times slower than the actual processor. As a consequence 
large numbers of faults can be studied at relatively little cost and in a 
timely manner. 

t The fault model, although somewhat arbitrary, can be updated as more data 
becomes available. 

• Gate-level equivalent circuits are available for digital devices including 
the 2901 . 

• Gate-level faults are more difficult to detect than component-level faults. 

• A computer self-test program of the order of 2000 executable instructions 
can detect 98% and possibly 99 or 100% of component-level faults. The 
feasibility of detecting the same proportions of gate-level faults remains 
to be determined. 

0 Emulation can be an important tool in the design of an efficient self-test. 

0 In a comparison-monitored system the accumulation of latent faults can be 
significant. In the study the proportion of undetected faults after 8 
repetitions ranged from 40 to 62%. 

0 For the range of values considered the proportion of undetected faults 
after 8 repetitions is a linear function of the number of executable 
instructions . 

0 With a suitable choice of parameters the Urn Model can be used to describe 
fault latency in a compari sion-monitored system. However, the proposed 
alternate model should be investigated. 

0 Faults in the micromemory are difficult to detect. 

0 In a comparison-monitored system most detected faults are detected in the 
first repetition of the program. Subsequent repetitions do not appreciably 
increase the proportion of detected faults. 

0 A gate-level emulation of a real processor may contain a large proportion 
of indistinguishable faults. Identifying such faults is difficult. 

0 Only 48% of all detected faults were detected by an explicit subtest of 
Self-Test; 52% were detected because the fault resulted in a wild branch. 
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• The results of the present study with regard to latent faults are in re- 
markably close agreement with the results of (ref. 1). The similarity is 
even more surprising when one considers that (ref. 1) employed a very simple, 
idealized processor with only 13 instructions and equally distributed 
faults. A comparison of the latency distributions for FETSTO, FIB and 
ADDSUB are given in Table 29. Based on this similarity it may be conjec- 
tured that the results of the present study can be extrapolated to other 
processors of comparable complexity. 
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TABLE 29 


COMPARISON OF LATENCY ESTIMATES 



FETSTO 

DETECTED 

FIB 

DETECTED 

ADDSUB 

DETECTED 

REPETITION 

REF (1) 

PHASE I 

REF (1) 

PHASE I 

REF (1) 

PHASE I 

1 

.187 

.3 

.261 

.35 

.313 

.335 

2 

.051 

.048 

.057 

.055 

.096 

.027 

3 

.017 

.021 

.047 

.007 

.067 

.03 

4 

.017 

0 

.009 

.002 

.024 

.003 

5 

.017 

0 

.028 

.002 

.014 

.005 

6 

.042 

.006 

.033 

.002 

.014 

.003 

7 

0 

.002 

.019 

0 

.019 

.002 

8 

.025 

.007 

.009 

.002 

.004 

0 

9 

(undetected) 

.644 

.617 

.537 

.582 

.449 

.595 


14.0 RECOMMENDATIONS FOR FUTURE STUDIES 


• The Phase I experiments should be repeated using flight critical, flight 
control computations. The instruction set should not be limited as it 
was in the present study. Additional tasks would include 

• Determination of the proportion of faults that affect the control 
surfaces . 

• Determination of the proportion of faults that prevent failure detection 
in the faulted processor. 

t Investigate other methods of fault detection such as the use of redundant 
computations in a non-redundant processor in a flight critical, flight 
flight control application. 

• Investigate the feasibility of extending the emulation to I/O interface 
devices such as AD and DA converters, I/O controllers, etc. 

• Generate more realistic fault models. Perhaps manufacturers could be pre- 
vailed upon to supply equivalent circuits that are more closely correlated 
with failure modes as well as with performance. 

§ Develop a more realistic Urn Model. The resultant model could be an impor- 
tant tool in reliability modelling of a redundant system. 
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